Statistical Data Mining 1 - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Statistical Data Mining 1

Description:

NDE using Ultrasound. Inspection of cast iron projectiles ... D. a. t. a. M. i. n. i. n. g. A. n. a. l. y. s. i. s. A. s. s. i. m. i. l. a. t. i. o. n. Effort ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 41
Provided by: edwardj8
Category:

less

Transcript and Presenter's Notes

Title: Statistical Data Mining 1


1
Statistical Data Mining - 1
  • Edward J. Wegman

A Short Course for Interface 01
2
Outline of Lecture
  • Complexity
  • Data Mining What is it?
  • Data Preparation

3
Complexity
  • Descriptor Data Set Size in Bytes Storage
    Mode
  • Tiny 102 Piece of Paper
  • Small 104 A Few Pieces of
    Paper
  • Medium 106 A Floppy Disk
  • Large 108 Hard Disk
  • Huge 1010 Multiple Hard
    Disks
  • e.g. RAID Storage
  • Massive 1012 Robotic Magnetic Tape
  • Storage Silos
  • The Huber Taxonomy of Data Set Sizes

4
Complexity
  • O(r),O(n 1/2) Plot a scatterplot
  • O(n) Calculate means, variances, kernel density
  • estimates
  • O(n log(n)) Calculate fast Fourier transforms
  • O(nc) Calculate singular value decomposition of
    an rc matrix solve a multiple linear
    regression
  • O(n2) Solve most clustering algorithms.
  • Algorithmic Complexity

5
Complexity
6
Complexity
7
Complexity
8
Complexity
9
Complexity
10
Complexity
11
Complexity
12
Complexity
13
Complexity
14
Complexity
  • Scenarios
  • Typical high resolution workstations,
  • 1280x1024 1.31x106 pixels
  • Realistic using Wegman, immersion, 45 aspect
    ratio,
  • 2333x1866 4.35x106 pixels
  • Very optimistic using 1 minute arc, immersion,
    45 aspect ratio,
  • 8400x6720 5.65x107 pixels
  • Wildly optimistic using Maar(2), immersion, 45
    aspect ratio,
  • 17,284x13,828 2.39x108 pixels

15
Massive Data Sets
  • One Terabyte Dataset
  • vs
  • One Million Megabyte Data Sets
  • Both difficult to analyze,
  • but for different reasons.

16
Massive Data Sets Commonly Used Language
  • Data Mining DM
  • Knowledge Discovery in Databases KDD
  • Massive Data Sets MD
  • Data Analysis DA

17
Massive Data Sets
18
Data Mining of Massive Datasets
  • Data Mining is Exploratory Data Analysis with
    Little or No Human Interaction using
    Computationally Feasible Techniques,
  • i.e., the Attempt to find Interesting Structure
    unknown a priori

19
Statistical Data Mining
  • Techniques
  • - Classification
  • - Clustering
  • - Neural Networks Genetic Algorithms
  • - CART
  • - Nonparametric Regression
  • - Time Series Trend Spectral Estimation
  • - Density Estimation, Bumps and Ridges

20
Massive Data Sets
  • Major Issues
  • Complexity
  • Non-homogeneity
  • Examples
  • Hubers Air Traffic Control
  • Highway Maintenance
  • Ultrasonic NDE

21
Massive Data Sets
  • Air Traffic Control
  • 6 to 12 Radar stations, several hundred aircraft,
    64-byte record per radar per aircraft per antenna
    turn
  • megabyte of data per minute

22
Massive Data Sets
  • Highway Maintenance
  • Records of maintenance records and measurements
    of road quality for several decades
  • Records of uneven quality
  • Records missing

23
Massive Data Sets
  • NDE using Ultrasound
  • Inspection of cast iron projectiles
  • Time series of length 256, 360 degrees, 550
    levels 50,688,000 observations per projectile
  • Several thousand projectiles per day

24
Massive Data Sets A Distinction
  • Human Analysis of the Structure of
  • Data and Pitfalls
  • vs
  • Human Analysis of the Data Itself
  • Limits of HVS and computational complexity limit
    the latter
  • Former is the basis for design of the analysis
    engine

25
Massive Data Sets
  • Data Types
  • Experimental
  • Observational
  • Opportunistic
  • Data Types
  • Numerical
  • Categorical
  • Image

26
Data Preparation
27
Data Preparation
6
0
5
0
4
0
Effort ()
3
0
2
0
1
0
0
O
b
j
e
c
t
i
v
e
s
D
a
t
a

P
r
e
p
a
r
a
t
i
o
n
D
a
t
a

M
i
n
i
n
g
A
n
a
l
y
s
i
s


D
e
t
e
r
m
i
n
a
t
i
o
n
A
s
s
i
m
i
l
a
t
i
o
n
28
Data Preparation
  • Data Cleaning and Quality
  • Types of Data
  • Categorical versus Continuous Data
  • Problem of Missing Data
  • Imputation
  • Missing Data Plots
  • Problem of Outliers
  • Dimension Reduction, Quantization, Sampling

29
Data Preparation
  • Quality
  • Data may not have any statistically significant
    patterns or relationships
  • Results may be inconsistent with other data sets
  • Data often of uneven quality, e.g. made up by
    respondent
  • Opportunistically collected data may have biases
    or errors
  • Discovered patterns may be too specific or too
    general to be useful

30
Data Preparation
  • Noise - Incorrect Values
  • Faulty data collection instruments, e.g. sensors
  • Transmission errors, e.g. intermittent errors
    from satellite or Internet transmissions
  • Data entry problems
  • Technology limitations
  • Naming conventions misused

31
Data Preparation
  • Noise - Incorrect Classification
  • Human judgment
  • Time varying
  • Uncertainty/Probabilistic nature of data

32
Data Preparation
  • Redundant/Stale data
  • Variables have different names in different
    databases
  • Raw variable in one database is a derived
    variable in another
  • Irrelevant variables destroy speed (dimension
    reduction needed)
  • Changes in variable over time not reflected in
    database

33
Data Preparation
  • Data cleaning
  • Selecting and appropriate data set and/or
    sampling strategy
  • Transformations

34
Data Preparation
  • Data Cleaning
  • Duplicate removal (tool based)
  • Missing value imputation (manual, statistical)
  • Identify and remove data inconsistencies
  • Identify and refresh stale data
  • Create unique record (case) ID

35
Data Preparation
  • Categorical versus Continuous Data
  • Most statistical theory, many graphics tools
    developed for continuous data
  • Much of the data if not most data in databases is
    categorical
  • Computer science view often takes continuous data
    into categorical, e.g. salaries categorized as
    low, medium, high, because more suited to Boolean
    operations

36
Data Preparation
  • Problem of Missing Values
  • Missing values in massive data sets may or may
    not be a problem
  • Missing data may be irrelevant to desired result,
    e.g. cases with missing demographic data may not
    help if I am trying to create selection mechanism
    for good customers based on demographics
  • Massive data sets if acquired by instrumentation
    may have few missing values anyway
  • Imputation has model assumptions
  • Suggest making a Missing Value Plot

37
Data Preparation
  • Missing Value Plot
  • A plot of variables by cases
  • Missing values colored red
  • Special case of color histogram with binary
    data
  • Color histogram also known as data image
  • This example is 67 dimensions by 1000 cases
  • This example is also fake

38
Data Preparation
  • Problem of Outliers
  • Outliers easy to detect in low dimensions
  • A high dimensional outlier may not show up in low
    dimensional projections
  • MVE or MCD algorithms are exponentially
    computationally complex
  • Fisher Info Matrix and Convex Hull Peeling more
    feasible but still too complex for Massive
    datasets

39
Data Preparation
  • Database Sampling
  • Exhaustive search may not be practically feasible
    because of their size
  • The KDD systems must be able to assist in the
    selection of appropriate parts if the databases
    to be examined
  • For sampling to work, the data must satisfy
    certain conditions (not ordered, no systematic
    biases)
  • Sampling can be very expensive operation
    especially when the sample is taken from data
    stored in a DBMS. Sampling 5 of the database can
    be more expensive that a sequential full scan of
    the data.

40
Data Compression
  • Often data preparation involves data compression
  • Sampling
  • Quantization
  • Subject of my talk later in the conference. See
    that talk for more details on this subject.
Write a Comment
User Comments (0)
About PowerShow.com