Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory SciDAC SDM-ISIC Kickoff Meeting July 10, 2001

About This Presentation

Title:

Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory SciDAC SDM-ISIC Kickoff Meeting July 10, 2001

Description:

'Big picture' view of data mining. Object recognition. and. Feature Extraction. Dimension ... Work with a climate data set from Ben Santer (LLNL) ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 13

Provided by: Compu277

Learn more at: https://sdm.lbl.gov

Category:

more less

Transcript and Presenter's Notes

Title: Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory SciDAC SDM-ISIC Kickoff Meeting July 10, 2001

1
Chandrika Kamath and Imola K. FodorCenter for
Applied Scientific ComputingLawrence Livermore
National LaboratorySciDAC SDM-ISIC Kickoff
MeetingJuly 10, 2001
Dimension Reduction and Sampling in the
Scientific Data Management Center (SDM-ISIC)
UCRL-PRES-144537 This work was performed under
the auspices of the U.S. Department of Energy by
University of California Lawrence Livermore
National Laboratory under contract no.
W-7405-Eng-48
2
We are borrowing ideas from data mining to
improve the management of data

Scientific data are often massive and high
dimensional
Need efficient techniques for storage and access
Efficient indexing through vertical partitioning
(LBNL task 2c.i)
Clustering (ORNL task 3c.i)
Our goal make these tasks more tractable by
reducing the number of dimensions

? We want to identify the most important
attributes of a data item so further
processing can be simplified without
compromising the quality of the final results.
3
MITs Technology Review (Jan 01) Data mining
is a top ten emerging technology

Data mining The semi-automatic discovery of
patterns, associations, anomalies, and
statistically significant structures in data
Pattern recognition The discovery and
characterization of patterns
Pattern An ordering with an underlying structure
Feature Extractable measurement or attribute

Pattern radio galaxy with a bent-double
morphology Features number of blobs
maximum intensity in a blob
spatial relationship between
blobs (distances and angles)

4
Big picture view of data mining
Object recognition and Feature Extraction
Dimension Reduction
Pattern Recognition
Raw Data
Information
Features Features
Data items
5
Classifying radio-emitting galaxies with a
bent-double morphology in the FIRST survey

Faint Images of the Radio Sky at Twenty
centimeters
Using the NRAO Very Large Array (VLA), B
configuration
10,000 square degrees survey, 90 radio galaxies
/ square-degree
1.8 pixels, resolution 5, rms 0.15mJy
Images maps and catalog available

6
FIRST data set Detecting bent-doubles in 250GB
image data, 78MB catalog data
Image Map
1150 pixels
Catalog 720K entries
1550 pixels
32K image maps, 7.1MB each
64 pixels
Catalog entry

Radio Galaxy
7
Our approach for classifying radio-galaxies using
feature from the catalog

Consider a region of interest
Group catalog entries within the ROI
Separate sources based on catalog entries
1-entry unlikely to be bent-doubles
gt 3-entry all interesting
classify 2- and 3-entry sources separately
a small training set becomes smaller
(313 ---gt 118 195)
Focus on the 3-entry galaxies
extract features 103 features
create a decision tree using the training set
use the tree to classify the unlabeled galaxies

8
We have used simple feature selection techniques
to reduce number of features

Input from domain experts
EDA techniques parallel plots and box plots
Wrapper approach

9
There are also more complex techniques for
dimension reduction

Principal component analysis
transform the features to be mutually
uncorrelated
focus on directions that maximize the variance
N data items in d dimensions
find the d-dimensional mean vector
obtain the d x d covariance matrix
obtain the d eigenvalues and eigenvectors of the
covariance matrix
keep k largest eigenvectors (k ltlt d)
project the (original data - mean) into the space
spanned by these vectors

? The eigenvectors or principal components (PCs)
are mutually orthogonal and the original data
is a linear combination of these PCs
10
We applied PCA to the problem of bent-double
classification

The first 20 PCs explained about 90 of the
variance
Eliminate unimportant variables
eliminate variable with largest coefficient in
e-vector corresponding to smallest e-value
repeat with the e-vector for the next smallest
e-value
continue till left with 20 variables

? Using only the 31 features found through EDA
and PCA lowers the decision tree error from
11.1 to 9.5
11
PCA does not provide a perfect solution to the
problem of dimension reduction

The linear combination makes interpretation
difficult
use the PCs to find important variables
May not produce separation of clusters
need to preserve interesting properties of data

? We want to consider non-linear and
non-orthogonal projections
12
Our current plan for task 3b.i

Work with a climate data set from Ben Santer
(LLNL)
understand issues from the climate viewpoint
identify features
apply PCA
investigate other techniques (projection pursuit,
independent component analysis, non-linear PCA)
Implementation issues
incremental implementation for a growing dataset
sampling to reduce number of items
Collaboration with ORNL, LBNL
feed reduced dimension dataset to task 3c.I
(ORNL)
understand the HyCeltyc algorithm (LBNL)
STAR HEP data (LBNL)