Diapositiva 1 - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Diapositiva 1

Description:

... different types of objects are there? Are there any rare or new types, ... understand we need somehow to compress the relevant information... Modeling of ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 31

Provided by: profgiuse9

Category:

more less

Transcript and Presenter's Notes

Title: Diapositiva 1

1
Neural (and non neural) tools for data mining in
massive data sets
Giuseppe Longo Department of Physical
SciencesUniversity Federico II in
NapoliI.N.F.N. Napoli UnitI.N.A.F. Napoli
Unit longo_at_na.infn.it
Thanks to S.G. Djorgoski from whom I borrowed
several slides
VO-Tech Meeting November 2004, Cambridge
2
Massive data sets in astronomy (?)
Pixel spaceD.M.
CataloguesD.M
3
Data Mining in the Image Domain(Images/spectra
but also in the time/lightcurve domain and
simulations)
SOCIOLOGICAL ISSUE
r.m.s. astronomer statement Every object
detection algorithm has its biases and
limitations and misses specific objects or fails
specific tasks. Therefore I dont trust any
algorithm and MY PIXELS ARE MINE!
To discover new phenomena or parametrize older
ones the KEYWORDS ARE Segmentation Automated
Pattern Recognition

Effective parametrization of source morphologies
and environments
Multiscale analysis
Spatially and/or photometrically correlated
objects
Drop outs Upper limits?
Etc.

4
High dimensionality parameter spaces
Along each axis the measurements are
characterized by the position, extent, sampling
and resolution. All astronomical measurements
span some volume in this parameter space.
5
Catalog Domain (Source Attributes)

Clustering Analysis (supervised and
unsupervised)
How many different types of objects are there?
Are there any rare or new types, outliers?
Multivariate Correlation Search
Are there significant, nontrivial correlations
present in the data?

Clusters vs. Correlations
Astrophysicists thrive for Correlations
Correlations helps in the reduction of the
statistical dimensionality
6
Fact In VO data sets DD gtgt 1, DS gtgt
1 Advantages Data Complexity ?
Multidimensionality ? Discoveries But
The computational cost of clustering analysis
K-means K ? N ? I ? D Expectation
Maximisation K ? N ? I ? D2 Monte Carlo
Cross-Validation M ? Kmax2 ? N ? I ? D2 N
no. of data vectors, D no. of data
dimensions K no. of clusters chosen, Kmax
max no. of clusters tried I no. of iterations,
M no. of Monte Carlo trials/partitions
Terascale (Petascale?) computing and/or better
algorithms
7
The Curse of Hyperdimensionality
WHAT DO WE DO WHEN N50 ?
8
To visualize and understand we need somehow to
compress the relevant information

Interactive visualization is a key part of the
data mining process
Some methodology exists, but much more is needed

9
The Astroneural Collaboration
P.I. Giuseppe Longo DSF Gennaro Miele
DSF Roberto Tagliaferri DMI Staff Roberto
Amato (DSF student) Angelo Ciaramella (DMI
post Doc) Carmine Del Mondo (DSF fellow) Lara
de Vinco (DMI fellow) Ciro Donalek (DSF
Ph.D.) Omar Laurino (DSF student) Gianpiero
Mangano (INFN - senior) Giancarlo Raiconi (DMI
senior) Antonio Staiano (DMI - post Doc)
10
Aims and applications of AstroNeural
User friendly tool to perform clustering and data
mining in high dimensionality spaces
Aims
Applications

Astrophysics
Genetics
Geophysics
High energy physics
Atmospheric physics
Etc.

Clustering pattern recognition in high
dimensionality spaces
Visualization
Classification
Parametrization of images
Modeling of massive data sets

Neural Networks are good at
performing linear and non linear interpolation
generalizing
performing non linear analysis and itentify
common trends in data
for forecasting
classifying

A priori knowledge needed
Training
Computation of errors

They learn in two main ways
Supervised
Unsupervised

Null/small a priori knowledge Clustering done on
statistical properties of data themeselves Perform
ances and errors are derived statistically Knowled
ge comes through labeling
12
There are hundreds of different NNs

MLP (Multi layer perceptron) slow, supervised,
non linear
SOM (self organizing maps) faster,
unsupervised, non linear, great visualization,
non physical output
GTM (generative topographic mapping) slow,
unsupervised, great visualization, physical
output
PCA ICA linear and non linear terrible
visualization, physical output, good performances
on uncorrelated data
Fuzzy C Means slow on MDSs, effective in fuzzy
problems
PPS great (the best ones for unsupervised
clustering, classification and visualization)
Competitive Evolution on Data (CED) bad
visualization, great accuracy as unsupervised
clustering tool,
Etc.

13
AstroneuralV1.01
Prototype in C Matlabpartial DEMO on this
Laptop
Supervisedunsupervised
Parameter options
unsupervised
supervised
Parameter and training options
Labeledunlabeled
labeled
Training setpreparation
Label preparation
Feature selection via unsupervised clustering
Feature selectionvia unsupervised clustering
Fuzzy set
Etc.
GTM
SOM
MLP
RBF
PPS.
Etc.
INTERPRETATION
14
(No Transcript)
15
3-D U Matrix
Similarity coloring
Feature significance maps
16
Example n. I photometric redshifts for SDSS
SDSS-EDR DB
Unsupervised SOMSs supervised MLPs
SOM unsup.completeness
ReliabilityMap
SOM unsup.Set construction
MLP supervisedexperiments
SOM supervisedFeature selection
Best MLP model

Input data set SDSS EDR photometric data
(galaxies)
Training/validation/test setSDSS-EDR
spectroscopic subsample

17
Unsupervised SOMSs supervised MLPs
Robust error 0.0217616 millions P.R.
SOM output (each exhagon is a neuron) Numbers
above frame redshift range Numbers in the cells
number of input data activating that neuron
18
Example N.2 monitoring of complex instruments

TNG telemetry monitors with high sampling rate
278 parameters.
SOM clustering on 278 parameters x 35.000 epochs
(no information on parameters correlation, etc.
78 actuators. 6 M2 bars 3 M3 parameters
temperatures, encoders, etc.)
AIM
Is it possible to monitor data quality using
telemetry?

19
B.E.S.
U-Matrix 278 parameters
Compressed feature space BMU matrices
?
UP good trackingBelow bad tracking
20
Probabilistic Principal Surfaces (PPS)
Based on latent variables high dimensionality
data are projected on lower (3D) dimensionality
ones (usually spheres in GTM)
24
21
Probabilistic Principal Surfaces (PPS)
Why oriented covariance?
Under a spherical Gaussian model of the GTM,
points 1 and 2 have equal influence on the center
node y(x) (a) PPS have an oriented covariance
matrix so point 1 is probabilistically closer to
the center node y(x) than point 2 (b)
22

Visualization with PPS
(a) The spherical manifold in R3 latent space.
(b) The spherical manifold in R3 data space.
(c) Projection of data point t onto the latent
spherical manifold.

23
Probabilistic Principal Surfaces (PPS)
GOODS Data set
24
Astroneural Version 1.01 ready Portation on
freeware software in progress Web interface
ready Implementation of specific tasks for GRID
use is in progress (NA, TS, CT,
etc.) Implementation of backwards connection to
pixels is in progress
25
Flexibility of use?
Aim to create a catalogue of genes with
transcription level changing periodically during
cellular cycle Spellman et al.,Molecular Biology
of the cell , 9, 3273 (1998). 6178 genes (4
experiments alpha factor arrest, cdc15
temperature-sensitive mutant , cdc28 ,
elutriation) uneven samplinghigh level of
noise. Poor a priori understanding
Saccharomyces cerevisiae genes
Feature selection via ICA NNs 32 features
fed to PPS and CED
Preprocessing genes reduce to 5425 with 32
features each
26
30 clusters (instead of 8)
27
Good ! New knowledge emerges Better understanding
of old knowledge
But Competitive Evolution on Data - CED is far
better!
28
(No Transcript)
29
High Energy Data Pierre AUGER FD data,
composition of primary particles MLP supervised
Simulated Corsika showers E1017 eV i0
30
Summary and Conclusions

NNs tools are intrinsically parallel (1 neuron
1 CPU)
NNs are flexible, high generalization and
visualization capabilities but they are only one
side of the medal
They work equally well on catalogues and pixels
data
Fine tuning on specific problems is always needed