Imola K' Fodor and Chandrika Kamath Center for Applied Scientific Computing Lawrence Livermore Natio - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Imola K' Fodor and Chandrika Kamath Center for Applied Scientific Computing Lawrence Livermore Natio

Description:

Imola K. Fodor and Chandrika Kamath. Center for Applied Scientific Computing ... When completed, will cover more than 10,000 deg ... Imola K. Fodor. Nu A. Tang ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 28

Provided by: imolak

Category:

more less

Transcript and Presenter's Notes

Title: Imola K' Fodor and Chandrika Kamath Center for Applied Scientific Computing Lawrence Livermore Natio

1
Imola K. Fodor and Chandrika KamathCenter for
Applied Scientific ComputingLawrence Livermore
National LaboratoryIPAM Workshop January,
2002
Mining the FIRST Astronomical Survey
2
Faint Images of the Radio Sky at
Twenty-Centimeters (FIRST)

On-going sky survey, started in 1993
When completed, will cover more than 10,000 deg
to a flux density limit of 1.0 mJy (milli-Jansky)
Current coverage is about 8,000 deg
more than 32,000 two-million pixel images
There are about 90 radio sources/deg
Data available at http//sundog.stsci.edu

NRAO Very Large Array (VLA)
3
One goal of FIRST is to identify radio galaxies
with a bent-double morphology

A bent-double galaxy is
Problem there is no definition of bent-double
Rough characteristic there is a radio emitting
core, along with a number of (not necessarily
two!) side- components that are bent around the
core
Astronomers search manually for bent-doubles

Bent-doubles

Non bent-doubles
4
Sapphire use data mining to enhance the visual
search for bent-doubles

Use galaxies classified by astronomers to model
the binary response variable Y
Find features X and model f(X) with desired
accuracy
Aim 10 misclassification error, as manual
classification is not more accurate

Denoising Feature extraction Dimension reduction
Classification
5
The FIRST catalog is based on fitting 2D
elliptical Gaussians to denoised images
Image Map
1150 pixels
Catalog 720K entries
1550 pixels
32K image maps, 7.1MB each
64 pixels
Catalog entry (CE)
Radio source (RS)
6
A first pre-processing step is to identify
potential features to discriminate bents

For the FIRST data, we extracted various features
based on
radio intensities, angles, distances,
For galaxies with 3 entries
a total of 103 features
three sets of single features, three pairs of
double features, and the triple features
possible redundancies
Reduce dimension using
domain knowledge
EDA
PCA
GLM step-wise model selection

7
Triple features for three catalog entries
8
Using exploratory data analysis (EDA), we reduced
the number of features to 25

Use EDA techniques such as
box-plots
multivariate plots
parallel-coordinate plots
correlation matrix
to
explore the data
find unusual observations
eliminate correlations among the features
Call these EDA features

9
Example parallel coordinate plot nine variables
split by bentness category

3/2 sky regions for bent/non-bent
X unusual
x
x
x
x
large negative correlation
Bent
Non-bent
10
Principal component analysis (PCA) finds linear
combinations of variables

Suppose we have p features
and we want a linear combination with max.
variance
By the spectral decomposition theorem,
the first PC, has maximal
variance, and
The total variance is preserved,
Dimension reduction use first k PCs as new
features

orthogonal,
11
We used PCA differently to reduce the number of
original features to 20

The first 20 PCs explain 90 of the variance
PCs are hard to interpret instead of using 20
PCs, keep 20 of the original variables
Multivariate Analysis (Mardia, Kent, Bibby)
consider the last PC, with the smallest variance
find the largest (in abs value) coefficient
, and discard the corresponding original variable
repeat the procedure w/ the second-to-last PC,
and iterate until only 20 variables remain
Call these PCA features

12
We also used step-wise model selection to reduce
the number of variables

Binary response Y bent, non-bent
Explanatory variables features
Logistic regression, step-wise model selection
with the AIC as a measure of goodness (minimize
-log-likelihood, with a penalty term for large
models)
Cannot use all 103 features because of
correlations
We identified the features selected by EDA or PCA
stepwise model selection gt GLM 2 features (25)
We identified the features selected by EDA and
PCA
stepwise model selection gt GLM 3 features (10)
stepwise model selection, including second-order
interactions gt GLM 4 features (9, 5
interactions)

13
Pattern recognition uses the features from
pre-processing to classify the data
Extract Features
Training data
Create Classifier Decision Tree GLM
Check for Accuracy
Extract Features for Unclassified Data
Apply Classifier to Unclassified Data
Show Results and Obtain Score
Update Training Data
An iterative and interactive classification
process
14
We use decision trees to classify the radio
sources into bents and non-bents

Use information gain to split
set of examples at a node
number of classes
split into two
number of class in

15
Decision tree created with all the features Tree
1

Resubstitution error, train/test (90) set 2.8
Cross-validation error, train/validate (10) set
5.3

16
Decision tree created with the EDA and PCA
features Tree 2

Resubstitution error 1.7
Cross-validation error 5.3

17
Decision tree created with the GLM 3 features
Tree 3

Resubstitution error 2.8
Cross-validation error 0
Using fewer, well-selected variables results in
smaller and more accurate trees

18
We also used generalized linear models (GLMs) to
classify the galaxies

Linear models explain response variables in terms
of linear combinations of explanatory variables
Least-squares estimate solves
No restrictions on the range of fitted values
GLMs allow such restrictions by modeling
where g() is a monotone increasing link
function

19
Logistic regression is a special GLM suitable for
modeling binary responses

Y0,1
Logit link and variance functions
Likelihood non-linear in parameters, no
closed-form solution iteratively reweighted
least squares to find
Given ,
where is 0,1 according to aFalse,
aTrue, and the fraction is generally taken
to be 0.5

20
GLM created with the GLM 2 features
21
GLM created with the GLM 3 features
22
GLM created with the GLM 4 features
23
Misclassification errors of best models are below
the desired 10 in training set

Careful selection of variables reduces error
Trees are less sensitive to input features than
GLMs
GLM 4 has lowest misclassification errors

Misclassification errors based on 10 ten-fold
cross-validations in the training set
24
Our methods identified the interesting part of
the FIRST dataset

15,059 three-entry radio sources in the 2000
catalog
2,577 labeled as bent by all six methods
Astronomers can start by exploring the smaller
set
Visually explore random samples to assess the
percentage of false positives and missed bents

Classification results for the entire 2000 catalog
25
Example classifications for previously unlabeled
galaxies are encouraging

The labels commonly assigned by the six methods
are correct in the examples below

Bent
Non-bent
26
Summary

Described how data mining can help identify radio
galaxies with bent-double morphology
Illustrated specific data mining steps
data pre-processing is very crucial
In our experience, data mining is semi-automatic
interaction and feedback required at many stages
domain knowledge is essential
Multi-disciplinary collaboration is challenging,
but rewarding
astronomy - computer science - statistics
There is always room for improvement
alternative techniques
your feedback welcome!

27
The Sapphire team supporting a
multi-disciplinary endeavor

Chandrika Kamath (Project Lead)
Erick Cantú-Paz
Imola K. Fodor
Nu A. Tang
Thanks to the FIRST scientists Robert Becker,
Michael Gregg, David Helfand, Sally
Laurent-Muehleisen, and Rick White

www.llnl.gov/casc/sapphire
UCRL-JC-145672. This work was performed under the
auspices of the U.S. Department of Energy
by University of California Lawrence Livermore
National Laboratory under contract W-7405-Eng-48.

Write a Comment

User Comments (0)