Diapositiva 1 - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Diapositiva 1

Description:

... different types of objects are there? Are there any rare or new types, ... understand we need somehow to compress the relevant information... Modeling of ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 31
Provided by: profgiuse9
Category:

less

Transcript and Presenter's Notes

Title: Diapositiva 1


1
Neural (and non neural) tools for data mining in
massive data sets
Giuseppe Longo Department of Physical
SciencesUniversity Federico II in
NapoliI.N.F.N. Napoli UnitI.N.A.F. Napoli
Unit longo_at_na.infn.it
Thanks to S.G. Djorgoski from whom I borrowed
several slides
VO-Tech Meeting November 2004, Cambridge
2
Massive data sets in astronomy (?)
Pixel spaceD.M.
CataloguesD.M
3
Data Mining in the Image Domain(Images/spectra
but also in the time/lightcurve domain and
simulations)
SOCIOLOGICAL ISSUE
r.m.s. astronomer statement Every object
detection algorithm has its biases and
limitations and misses specific objects or fails
specific tasks. Therefore I dont trust any
algorithm and MY PIXELS ARE MINE!
To discover new phenomena or parametrize older
ones the KEYWORDS ARE Segmentation Automated
Pattern Recognition
  • Effective parametrization of source morphologies
    and environments
  • Multiscale analysis
  • Spatially and/or photometrically correlated
    objects
  • Drop outs Upper limits?
  • Etc.

4
High dimensionality parameter spaces
Along each axis the measurements are
characterized by the position, extent, sampling
and resolution. All astronomical measurements
span some volume in this parameter space.
5
Catalog Domain (Source Attributes)
  • Clustering Analysis (supervised and
    unsupervised)
  • How many different types of objects are there?
  • Are there any rare or new types, outliers?
  • Multivariate Correlation Search
  • Are there significant, nontrivial correlations
    present in the data?

Clusters vs. Correlations
Astrophysicists thrive for Correlations
Correlations helps in the reduction of the
statistical dimensionality
6
Fact In VO data sets DD gtgt 1, DS gtgt
1 Advantages Data Complexity ?
Multidimensionality ? Discoveries But
The computational cost of clustering analysis
K-means K ? N ? I ? D Expectation
Maximisation K ? N ? I ? D2 Monte Carlo
Cross-Validation M ? Kmax2 ? N ? I ? D2 N
no. of data vectors, D no. of data
dimensions K no. of clusters chosen, Kmax
max no. of clusters tried I no. of iterations,
M no. of Monte Carlo trials/partitions
Terascale (Petascale?) computing and/or better
algorithms
7
The Curse of Hyperdimensionality
WHAT DO WE DO WHEN N50 ?
8
To visualize and understand we need somehow to
compress the relevant information
  • Interactive visualization is a key part of the
    data mining process
  • Some methodology exists, but much more is needed

9
The Astroneural Collaboration
P.I. Giuseppe Longo DSF Gennaro Miele
DSF Roberto Tagliaferri DMI Staff Roberto
Amato (DSF student) Angelo Ciaramella (DMI
post Doc) Carmine Del Mondo (DSF fellow) Lara
de Vinco (DMI fellow) Ciro Donalek (DSF
Ph.D.) Omar Laurino (DSF student) Gianpiero
Mangano (INFN - senior) Giancarlo Raiconi (DMI
senior) Antonio Staiano (DMI - post Doc)
10
Aims and applications of AstroNeural
User friendly tool to perform clustering and data
mining in high dimensionality spaces
Aims
Applications
  • Astrophysics
  • Genetics
  • Geophysics
  • High energy physics
  • Atmospheric physics
  • Etc.
  • Clustering pattern recognition in high
    dimensionality spaces
  • Visualization
  • Classification
  • Parametrization of images
  • Modeling of massive data sets

11
  • Neural Networks are good at
  • performing linear and non linear interpolation
  • generalizing
  • performing non linear analysis and itentify
    common trends in data
  • for forecasting
  • classifying
  • A priori knowledge needed
  • Training
  • Computation of errors
  • They learn in two main ways
  • Supervised
  • Unsupervised

Null/small a priori knowledge Clustering done on
statistical properties of data themeselves Perform
ances and errors are derived statistically Knowled
ge comes through labeling
12
There are hundreds of different NNs
  • MLP (Multi layer perceptron) slow, supervised,
    non linear
  • SOM (self organizing maps) faster,
    unsupervised, non linear, great visualization,
    non physical output
  • GTM (generative topographic mapping) slow,
    unsupervised, great visualization, physical
    output
  • PCA ICA linear and non linear terrible
    visualization, physical output, good performances
    on uncorrelated data
  • Fuzzy C Means slow on MDSs, effective in fuzzy
    problems
  • PPS great (the best ones for unsupervised
    clustering, classification and visualization)
  • Competitive Evolution on Data (CED) bad
    visualization, great accuracy as unsupervised
    clustering tool,
  • Etc.

13
AstroneuralV1.01
Prototype in C Matlabpartial DEMO on this
Laptop
Supervisedunsupervised
Parameter options
unsupervised
supervised
Parameter and training options
Labeledunlabeled
labeled
Training setpreparation
Label preparation
Feature selection via unsupervised clustering
Feature selectionvia unsupervised clustering
Fuzzy set
Etc.
GTM
SOM
MLP
RBF
PPS.
Etc.
INTERPRETATION
14
(No Transcript)
15
3-D U Matrix
Similarity coloring
Feature significance maps
16
Example n. I photometric redshifts for SDSS
SDSS-EDR DB
Unsupervised SOMSs supervised MLPs
SOM unsup.completeness
ReliabilityMap
SOM unsup.Set construction
MLP supervisedexperiments
SOM supervisedFeature selection
Best MLP model
  • Input data set SDSS EDR photometric data
    (galaxies)
  • Training/validation/test setSDSS-EDR
    spectroscopic subsample

17
Unsupervised SOMSs supervised MLPs
Robust error 0.0217616 millions P.R.
SOM output (each exhagon is a neuron) Numbers
above frame redshift range Numbers in the cells
number of input data activating that neuron
18
Example N.2 monitoring of complex instruments
  • TNG telemetry monitors with high sampling rate
    278 parameters.
  • SOM clustering on 278 parameters x 35.000 epochs
    (no information on parameters correlation, etc.
    78 actuators. 6 M2 bars 3 M3 parameters
    temperatures, encoders, etc.)
  • AIM
  • Is it possible to monitor data quality using
    telemetry?

19
B.E.S.
U-Matrix 278 parameters
Compressed feature space BMU matrices
?
UP good trackingBelow bad tracking
20
Probabilistic Principal Surfaces (PPS)
Based on latent variables high dimensionality
data are projected on lower (3D) dimensionality
ones (usually spheres in GTM)
24
21
Probabilistic Principal Surfaces (PPS)
Why oriented covariance?
Under a spherical Gaussian model of the GTM,
points 1 and 2 have equal influence on the center
node y(x) (a) PPS have an oriented covariance
matrix so point 1 is probabilistically closer to
the center node y(x) than point 2 (b)
22
  • Visualization with PPS
  • (a) The spherical manifold in R3 latent space.
  • (b) The spherical manifold in R3 data space.
  • (c) Projection of data point t onto the latent
    spherical manifold.

23
Probabilistic Principal Surfaces (PPS)
GOODS Data set
24
Astroneural Version 1.01 ready Portation on
freeware software in progress Web interface
ready Implementation of specific tasks for GRID
use is in progress (NA, TS, CT,
etc.) Implementation of backwards connection to
pixels is in progress
25
Flexibility of use?
Aim to create a catalogue of genes with
transcription level changing periodically during
cellular cycle Spellman et al.,Molecular Biology
of the cell , 9, 3273 (1998). 6178 genes (4
experiments alpha factor arrest, cdc15
temperature-sensitive mutant , cdc28 ,
elutriation) uneven samplinghigh level of
noise. Poor a priori understanding
Saccharomyces cerevisiae genes
Feature selection via ICA NNs 32 features
fed to PPS and CED
Preprocessing genes reduce to 5425 with 32
features each
26
30 clusters (instead of 8)
27
Good ! New knowledge emerges Better understanding
of old knowledge
But Competitive Evolution on Data - CED is far
better!
28
(No Transcript)
29
High Energy Data Pierre AUGER FD data,
composition of primary particles MLP supervised
Simulated Corsika showers E1017 eV i0
30
Summary and Conclusions
  • NNs tools are intrinsically parallel (1 neuron
    1 CPU)
  • NNs are flexible, high generalization and
    visualization capabilities but they are only one
    side of the medal
  • They work equally well on catalogues and pixels
    data
  • Fine tuning on specific problems is always needed
  • Foreseeable steps
  • Integration with Astro-MD
  • VO-table compliant I/O preprocessing or link to
    TOMCAT
  • WEB services implementation for some of the DM
    tools
  • Development and implementation of tools
    accordingly to needs
Write a Comment
User Comments (0)
About PowerShow.com