Imola K' Fodor and Chandrika Kamath Center for Applied Scientific Computing Lawrence Livermore Natio - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Imola K' Fodor and Chandrika Kamath Center for Applied Scientific Computing Lawrence Livermore Natio

Description:

Imola K. Fodor and Chandrika Kamath. Center for Applied Scientific Computing ... When completed, will cover more than 10,000 deg ... Imola K. Fodor. Nu A. Tang ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 28
Provided by: imolak
Category:

less

Transcript and Presenter's Notes

Title: Imola K' Fodor and Chandrika Kamath Center for Applied Scientific Computing Lawrence Livermore Natio


1
Imola K. Fodor and Chandrika KamathCenter for
Applied Scientific ComputingLawrence Livermore
National LaboratoryIPAM Workshop January,
2002
Mining the FIRST Astronomical Survey
2
Faint Images of the Radio Sky at
Twenty-Centimeters (FIRST)
  • On-going sky survey, started in 1993
  • When completed, will cover more than 10,000 deg
    to a flux density limit of 1.0 mJy (milli-Jansky)
  • Current coverage is about 8,000 deg
  • more than 32,000 two-million pixel images
  • There are about 90 radio sources/deg
  • Data available at http//sundog.stsci.edu


NRAO Very Large Array (VLA)
3
One goal of FIRST is to identify radio galaxies
with a bent-double morphology
  • A bent-double galaxy is
  • Problem there is no definition of bent-double
  • Rough characteristic there is a radio emitting
    core, along with a number of (not necessarily
    two!) side- components that are bent around the
    core
  • Astronomers search manually for bent-doubles

Bent-doubles

Non bent-doubles
4
Sapphire use data mining to enhance the visual
search for bent-doubles
  • Use galaxies classified by astronomers to model
    the binary response variable Y
  • Find features X and model f(X) with desired
    accuracy
  • Aim 10 misclassification error, as manual
    classification is not more accurate

Denoising Feature extraction Dimension reduction
Classification
5
The FIRST catalog is based on fitting 2D
elliptical Gaussians to denoised images
Image Map
1150 pixels
Catalog 720K entries
1550 pixels
32K image maps, 7.1MB each
64 pixels
Catalog entry (CE)
Radio source (RS)
6
A first pre-processing step is to identify
potential features to discriminate bents
  • For the FIRST data, we extracted various features
    based on
  • radio intensities, angles, distances,
  • For galaxies with 3 entries
  • a total of 103 features
  • three sets of single features, three pairs of
    double features, and the triple features
  • possible redundancies
  • Reduce dimension using
  • domain knowledge
  • EDA
  • PCA
  • GLM step-wise model selection

7
Triple features for three catalog entries
8
Using exploratory data analysis (EDA), we reduced
the number of features to 25
  • Use EDA techniques such as
  • box-plots
  • multivariate plots
  • parallel-coordinate plots
  • correlation matrix
  • to
  • explore the data
  • find unusual observations
  • eliminate correlations among the features
  • Call these EDA features

9
Example parallel coordinate plot nine variables
split by bentness category
  • x

3/2 sky regions for bent/non-bent
X unusual
x
x
x
x
large negative correlation
Bent
Non-bent
10
Principal component analysis (PCA) finds linear
combinations of variables
  • Suppose we have p features
  • and we want a linear combination with max.
    variance
  • By the spectral decomposition theorem,
  • the first PC, has maximal
    variance, and
  • The total variance is preserved,
  • Dimension reduction use first k PCs as new
    features

orthogonal,
11
We used PCA differently to reduce the number of
original features to 20
  • The first 20 PCs explain 90 of the variance
  • PCs are hard to interpret instead of using 20
    PCs, keep 20 of the original variables
  • Multivariate Analysis (Mardia, Kent, Bibby)
  • consider the last PC, with the smallest variance
  • find the largest (in abs value) coefficient
    , and discard the corresponding original variable
  • repeat the procedure w/ the second-to-last PC,
    and iterate until only 20 variables remain
  • Call these PCA features

12
We also used step-wise model selection to reduce
the number of variables
  • Binary response Y bent, non-bent
  • Explanatory variables features
  • Logistic regression, step-wise model selection
    with the AIC as a measure of goodness (minimize
    -log-likelihood, with a penalty term for large
    models)
  • Cannot use all 103 features because of
    correlations
  • We identified the features selected by EDA or PCA
  • stepwise model selection gt GLM 2 features (25)
  • We identified the features selected by EDA and
    PCA
  • stepwise model selection gt GLM 3 features (10)
  • stepwise model selection, including second-order
    interactions gt GLM 4 features (9, 5
    interactions)

13
Pattern recognition uses the features from
pre-processing to classify the data
Extract Features
Training data
Create Classifier Decision Tree GLM
Check for Accuracy
Extract Features for Unclassified Data
Apply Classifier to Unclassified Data
Show Results and Obtain Score
Update Training Data
An iterative and interactive classification
process
14
We use decision trees to classify the radio
sources into bents and non-bents
  • Use information gain to split
  • set of examples at a node
  • number of classes
  • split into two
  • number of class in

15
Decision tree created with all the features Tree
1
  • Resubstitution error, train/test (90) set 2.8
  • Cross-validation error, train/validate (10) set
    5.3

16
Decision tree created with the EDA and PCA
features Tree 2
  • Resubstitution error 1.7
  • Cross-validation error 5.3

17
Decision tree created with the GLM 3 features
Tree 3
  • Resubstitution error 2.8
  • Cross-validation error 0
  • Using fewer, well-selected variables results in
    smaller and more accurate trees

18
We also used generalized linear models (GLMs) to
classify the galaxies
  • Linear models explain response variables in terms
    of linear combinations of explanatory variables
  • Least-squares estimate solves
  • No restrictions on the range of fitted values
  • GLMs allow such restrictions by modeling
  • where g() is a monotone increasing link
    function

19
Logistic regression is a special GLM suitable for
modeling binary responses
  • Y0,1
  • Logit link and variance functions
  • Likelihood non-linear in parameters, no
    closed-form solution iteratively reweighted
    least squares to find
  • Given ,
  • where is 0,1 according to aFalse,
    aTrue, and the fraction is generally taken
    to be 0.5

20
GLM created with the GLM 2 features
21
GLM created with the GLM 3 features
22
GLM created with the GLM 4 features
23
Misclassification errors of best models are below
the desired 10 in training set
  • Careful selection of variables reduces error
  • Trees are less sensitive to input features than
    GLMs
  • GLM 4 has lowest misclassification errors

Misclassification errors based on 10 ten-fold
cross-validations in the training set
24
Our methods identified the interesting part of
the FIRST dataset
  • 15,059 three-entry radio sources in the 2000
    catalog
  • 2,577 labeled as bent by all six methods
  • Astronomers can start by exploring the smaller
    set
  • Visually explore random samples to assess the
    percentage of false positives and missed bents

Classification results for the entire 2000 catalog
25
Example classifications for previously unlabeled
galaxies are encouraging
  • The labels commonly assigned by the six methods
    are correct in the examples below

Bent
Non-bent
26
Summary
  • Described how data mining can help identify radio
    galaxies with bent-double morphology
  • Illustrated specific data mining steps
  • data pre-processing is very crucial
  • In our experience, data mining is semi-automatic
  • interaction and feedback required at many stages
  • domain knowledge is essential
  • Multi-disciplinary collaboration is challenging,
    but rewarding
  • astronomy - computer science - statistics
  • There is always room for improvement
  • alternative techniques
  • your feedback welcome!

27
The Sapphire team supporting a
multi-disciplinary endeavor
  • Chandrika Kamath (Project Lead)
  • Erick Cantú-Paz
  • Imola K. Fodor
  • Nu A. Tang
  • Thanks to the FIRST scientists Robert Becker,
    Michael Gregg, David Helfand, Sally
    Laurent-Muehleisen, and Rick White

www.llnl.gov/casc/sapphire
UCRL-JC-145672. This work was performed under the
auspices of the U.S. Department of Energy
by University of California Lawrence Livermore
National Laboratory under contract W-7405-Eng-48.
Write a Comment
User Comments (0)
About PowerShow.com