Statistical Data Mining 1 PowerPoint PPT Presentation

presentation player overlay
1 / 40
About This Presentation
Transcript and Presenter's Notes

Title: Statistical Data Mining 1


1
Statistical Data Mining - 1
  • Edward J. Wegman

A Short Course for Interface 01
2
Outline of Lecture
  • Complexity
  • Data Mining What is it?
  • Data Preparation

3
Complexity
  • Descriptor Data Set Size in Bytes Storage
    Mode
  • Tiny 102 Piece of Paper
  • Small 104 A Few Pieces of
    Paper
  • Medium 106 A Floppy Disk
  • Large 108 Hard Disk
  • Huge 1010 Multiple Hard
    Disks
  • e.g. RAID Storage
  • Massive 1012 Robotic Magnetic Tape
  • Storage Silos
  • The Huber Taxonomy of Data Set Sizes

4
Complexity
  • O(r),O(n 1/2) Plot a scatterplot
  • O(n) Calculate means, variances, kernel density
  • estimates
  • O(n log(n)) Calculate fast Fourier transforms
  • O(nc) Calculate singular value decomposition of
    an rc matrix solve a multiple linear
    regression
  • O(n2) Solve most clustering algorithms.
  • Algorithmic Complexity

5
Complexity
6
Complexity
7
Complexity
8
Complexity
9
Complexity
10
Complexity
11
Complexity
12
Complexity
13
Complexity
14
Complexity
  • Scenarios
  • Typical high resolution workstations,
  • 1280x1024 1.31x106 pixels
  • Realistic using Wegman, immersion, 45 aspect
    ratio,
  • 2333x1866 4.35x106 pixels
  • Very optimistic using 1 minute arc, immersion,
    45 aspect ratio,
  • 8400x6720 5.65x107 pixels
  • Wildly optimistic using Maar(2), immersion, 45
    aspect ratio,
  • 17,284x13,828 2.39x108 pixels

15
Massive Data Sets
  • One Terabyte Dataset
  • vs
  • One Million Megabyte Data Sets
  • Both difficult to analyze,
  • but for different reasons.

16
Massive Data Sets Commonly Used Language
  • Data Mining DM
  • Knowledge Discovery in Databases KDD
  • Massive Data Sets MD
  • Data Analysis DA

17
Massive Data Sets
18
Data Mining of Massive Datasets
  • Data Mining is Exploratory Data Analysis with
    Little or No Human Interaction using
    Computationally Feasible Techniques,
  • i.e., the Attempt to find Interesting Structure
    unknown a priori

19
Statistical Data Mining
  • Techniques
  • - Classification
  • - Clustering
  • - Neural Networks Genetic Algorithms
  • - CART
  • - Nonparametric Regression
  • - Time Series Trend Spectral Estimation
  • - Density Estimation, Bumps and Ridges

20
Massive Data Sets
  • Major Issues
  • Complexity
  • Non-homogeneity
  • Examples
  • Hubers Air Traffic Control
  • Highway Maintenance
  • Ultrasonic NDE

21
Massive Data Sets
  • Air Traffic Control
  • 6 to 12 Radar stations, several hundred aircraft,
    64-byte record per radar per aircraft per antenna
    turn
  • megabyte of data per minute

22
Massive Data Sets
  • Highway Maintenance
  • Records of maintenance records and measurements
    of road quality for several decades
  • Records of uneven quality
  • Records missing

23
Massive Data Sets
  • NDE using Ultrasound
  • Inspection of cast iron projectiles
  • Time series of length 256, 360 degrees, 550
    levels 50,688,000 observations per projectile
  • Several thousand projectiles per day

24
Massive Data Sets A Distinction
  • Human Analysis of the Structure of
  • Data and Pitfalls
  • vs
  • Human Analysis of the Data Itself
  • Limits of HVS and computational complexity limit
    the latter
  • Former is the basis for design of the analysis
    engine

25
Massive Data Sets
  • Data Types
  • Experimental
  • Observational
  • Opportunistic
  • Data Types
  • Numerical
  • Categorical
  • Image

26
Data Preparation
27
Data Preparation
6
0
5
0
4
0
Effort ()
3
0
2
0
1
0
0
O
b
j
e
c
t
i
v
e
s
D
a
t
a

P
r
e
p
a
r
a
t
i
o
n
D
a
t
a

M
i
n
i
n
g
A
n
a
l
y
s
i
s


D
e
t
e
r
m
i
n
a
t
i
o
n
A
s
s
i
m
i
l
a
t
i
o
n
28
Data Preparation
  • Data Cleaning and Quality
  • Types of Data
  • Categorical versus Continuous Data
  • Problem of Missing Data
  • Imputation
  • Missing Data Plots
  • Problem of Outliers
  • Dimension Reduction, Quantization, Sampling

29
Data Preparation
  • Quality
  • Data may not have any statistically significant
    patterns or relationships
  • Results may be inconsistent with other data sets
  • Data often of uneven quality, e.g. made up by
    respondent
  • Opportunistically collected data may have biases
    or errors
  • Discovered patterns may be too specific or too
    general to be useful

30
Data Preparation
  • Noise - Incorrect Values
  • Faulty data collection instruments, e.g. sensors
  • Transmission errors, e.g. intermittent errors
    from satellite or Internet transmissions
  • Data entry problems
  • Technology limitations
  • Naming conventions misused

31
Data Preparation
  • Noise - Incorrect Classification
  • Human judgment
  • Time varying
  • Uncertainty/Probabilistic nature of data

32
Data Preparation
  • Redundant/Stale data
  • Variables have different names in different
    databases
  • Raw variable in one database is a derived
    variable in another
  • Irrelevant variables destroy speed (dimension
    reduction needed)
  • Changes in variable over time not reflected in
    database

33
Data Preparation
  • Data cleaning
  • Selecting and appropriate data set and/or
    sampling strategy
  • Transformations

34
Data Preparation
  • Data Cleaning
  • Duplicate removal (tool based)
  • Missing value imputation (manual, statistical)
  • Identify and remove data inconsistencies
  • Identify and refresh stale data
  • Create unique record (case) ID

35
Data Preparation
  • Categorical versus Continuous Data
  • Most statistical theory, many graphics tools
    developed for continuous data
  • Much of the data if not most data in databases is
    categorical
  • Computer science view often takes continuous data
    into categorical, e.g. salaries categorized as
    low, medium, high, because more suited to Boolean
    operations

36
Data Preparation
  • Problem of Missing Values
  • Missing values in massive data sets may or may
    not be a problem
  • Missing data may be irrelevant to desired result,
    e.g. cases with missing demographic data may not
    help if I am trying to create selection mechanism
    for good customers based on demographics
  • Massive data sets if acquired by instrumentation
    may have few missing values anyway
  • Imputation has model assumptions
  • Suggest making a Missing Value Plot

37
Data Preparation
  • Missing Value Plot
  • A plot of variables by cases
  • Missing values colored red
  • Special case of color histogram with binary
    data
  • Color histogram also known as data image
  • This example is 67 dimensions by 1000 cases
  • This example is also fake

38
Data Preparation
  • Problem of Outliers
  • Outliers easy to detect in low dimensions
  • A high dimensional outlier may not show up in low
    dimensional projections
  • MVE or MCD algorithms are exponentially
    computationally complex
  • Fisher Info Matrix and Convex Hull Peeling more
    feasible but still too complex for Massive
    datasets

39
Data Preparation
  • Database Sampling
  • Exhaustive search may not be practically feasible
    because of their size
  • The KDD systems must be able to assist in the
    selection of appropriate parts if the databases
    to be examined
  • For sampling to work, the data must satisfy
    certain conditions (not ordered, no systematic
    biases)
  • Sampling can be very expensive operation
    especially when the sample is taken from data
    stored in a DBMS. Sampling 5 of the database can
    be more expensive that a sequential full scan of
    the data.

40
Data Compression
  • Often data preparation involves data compression
  • Sampling
  • Quantization
  • Subject of my talk later in the conference. See
    that talk for more details on this subject.
Write a Comment
User Comments (0)
About PowerShow.com