Chapter 7 Preparing Scientific and Engineering Data for Mining - PowerPoint PPT Presentation


PPT – Chapter 7 Preparing Scientific and Engineering Data for Mining PowerPoint presentation | free to download - id: c65d8-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Chapter 7 Preparing Scientific and Engineering Data for Mining


UCRL-PRES-145087: The work of Chandrika Kamath in ... University of California Lawrence Livermore National ... Genetic algo. FITS. netCDF. View ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 51
Provided by: Computa8


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Chapter 7 Preparing Scientific and Engineering Data for Mining

Chapter 7 - Preparing Scientific and Engineering
Data for Mining
  • Chandrika Kamath
  • Center for Applied Scientific Computing
  • Lawrence Livermore National Laboratory
  • http//

UCRL-PRES-145087 The work of Chandrika Kamath
in Chapters 5, 6, and 7 was performed under the
auspices of the U.S. Department of Energy by the
University of California Lawrence Livermore
National Laboratory under contract No.
The input data cannot be directly input to the
pattern recognition algorithms
pattern recognition algorithms
Input Data
Images/Meshes Time-dependent Multi-sensor Compres
sed Spatio-temporal Massive 2,3,4 dimensions
Classification Clustering Regression .
Data items
? The input data must be processed to make it
suitable for the pattern recognition algorithms.
Science and engineering data are available in
different formats
  • Different storage formats
  • FITS, AIPS in astronomy
  • netCDF, GRIB (grid in binary) in climate
  • Different ways of generating output
  • sea surface temps for each month in a file
  • sea surface temps for each year in a file
  • Depending on the problem, data can be
  • one-dimensional, usually time series, from
    sensors or processing of other data
  • two-dimensional (spatial) time
  • three-dimensional (spatial) time

Two-dimensional scientific data is available as
images or as meshes
  • Can have spatial and temporal aspects
  • Images
  • pixel values gray-scale or real
  • a scene obtained using different sensors, at
    different times, at different resolutions
  • can be noisy, with noise varying between images
    and within an image
  • Mesh
  • values at a mesh point are real
  • cell centered, node centered or edge

Three dimensional scientific data comes from
modeling objects in 3-D
  • Values at a mesh point are real
  • Can be cell centered, node centered, edge
    centered, or face centered
  • Often have a series of meshes in time spatial
    and temporal aspect

The complexity of meshes makes it difficult to
extract features
Cartesian Structured
Structured Unstructured
The distribution of mesh points can change with
time - need feature tracking
Composed Unstructured mesh
Hierarchy of regular meshes
Composite meshes - locally structured, globally
Science data is not often in a form ready for
pattern recognition
  • Data available as pixels or values at mesh points
  • But, the patterns of interest are at a higher

The raw data must be transformed into features
before we can apply pattern recognition. Extracti
ng features that are robust, relevant to the
problem, and invariant to scaling, rotation, and
translation is non-trivial and time consuming -
but, essential to the success of the pattern
recognition algorithm.
Most of the work in data mining focuses on
pattern recognition, BUT.
  • it is the data pre-processing which is
  • more influential and time consuming
  • domain specific and therefore less general
  • perhaps as little as 10 effort was spent on
    classification aspects of the problem. (Burl
  • Langley/Simon 95 much of the power comes not
    from the specific induction method, but from
    proper formulation of the problems and from
    crafting the representation to make learning
  • Brodley/Smyth 95 in practical applications,
    it is often the data and human issues which
    ultimately dictate success or failure of a
    project rather than algorithmic and model issues.

The Sapphire view of the end-to-end data mining
process (Kamath01)
Raw Data
Target Data
Preprocessed Data
Transformed Data
Data Preprocessing
Pattern Recognition
Interpreting Results
De-noising Object- identification Feature-
extraction Normalization
Dimension- reduction
Data Fusion Sampling Multi-resolution analysis
Classification Clustering Regression .
Visualization Validation
Lets make a few simple assumptions in our
discussion of data preparation .
  • We understand the problem and the data
  • We have formulated a solution approach
  • We have relatively easy access to the data
  • We have the software to read, write, and display
    the data
  • We have the software to bring the data into a
    consistent format

? To satisfy these simple assumptions may
require far more time than you expect!
Data fusion may be necessary when data from many
sources is available
  • Combining information from more than one source
    to make a more accurate and better informed
  • Exploit complementary information from different
    sensors, at different wavelengths, from different

Images of the Crab Nebula from
Data registration is an important part of data
  • Registrationalign images to relate information
    in one image to information in another image
  • Used in data fusion and change detection

Translation Rigid body
Rotation Horizontal shear
? Obtain a global or local transformation to
match the input data to the reference data.
There are four major components of data
registration (Brown 92)
  • Feature space what features do we use for
  • pixels, edges, contours, corners,.
  • Search space what transformation to use to
    establish correspondence between input and
    reference data?
  • translations, rotations, scaling,.
  • Search strategy which transformations are
    computed and evaluated?
  • exhaustive search, multiresolution methods
  • Similarity metric how to evaluate the match
    between input and reference data?
  • mean square error, sum of abs differences

Some recent work in image registration
  • An excellent survey Brown 92.
  • Use of wavelet-based multi-resolution techniques
    (Le Moigne 94)
  • Using evolutionary algorithms as a search
    strategy (Mandava 89)
  • Using the Levenberg-Marquardt optimization
    strategy (Thevenaz 98).

The data may need to be de-noised to better
identify the objects
  • Noise in the data can be due to the data
    acquisition process or natural phenomena such as
    atmospheric turbulence
  • De-noising is difficult as cannot always tell
    what is the signal and what is the noise
  • A simple approach thresholding
  • drop all values below a threshold
  • how do we calculate the threshold?

Simple filters can be used to smooth the data
and minimize the noise effects
  • Convolve the image with a filter

Non-zero locations of a 3 by 3 filter
Convolution of filter f with image I
Examples of some simple filters
  • Filters can vary in width - a wide filter gives
    better noise reduction, but smooths the edges
  • Mean filters
  • Gaussian filters

Multi-resolution analysis using wavelets
  • Using appropriate filters, decompose a signal
  • high frequency part (detail coefficients)
  • low frequency part (smooth coefficients)
  • In 2-dimensions, apply first along one dimension
    and then the other
  • Choice of wavelets, transforms, boundary
    conditions, number of levels

64 48 16 32 56 56 48 24 56 24 56 36
8 -8 0 12 40 46 16 10 8 -8 0
12 43 -3 16 10 8 -8 0 12
Wavelet multi-resolution analysis Haar wavelet,
periodic boundary conditions
Vertical Diagonal
2 level decomposition
Wavelets can be used for removing noise from data
  • Useful when data is available compressed using
    wavelet transforms
  • Basic idea - drop detail coefficients below a
  • Extensive study (Fodor/Kamath 01)
  • several wavelets, boundary treatments
  • several shrinkage rules, shrinkage functions
  • compare with linear and non-linear spatial

Inverse wavelet transform
Forward wavelet transform
Calculate threshold
Apply threshold
Noisy Image
Denoised Image
Comparison of denoising wavelets (symmlet12) vs.
spatial filters
Noisy, 20 MSE398
MMSE Gaussian MSE65
Results of study on wavelet-based statistical
techniques for denoising data
  • Results independent of choice of wavelets
  • Soft thresholding better than hard or semi-soft
  • SURE and Bayes rules are consistently better
  • Wavelets preserve edges better introduce
  • Wavelets are not good at structured noise
  • Combination of spatial filters may give smaller
  • Spatial filters often blur edges
  • Other approaches - diffusion-based methods, level
    set methods, ENO and TVD schemes, non-decimated
    transforms, curvelets, .

De-noising techniques applied to the FIRST images
Unsharp mask
Simple threshold
Wavelets can be a useful tool in several aspects
of data mining (Fodor/Kamath 00)
  • Very effective in compression
  • astronomy, simulations, FBI fingerprints
  • JPEG 2000, MPEG-4 standards
  • progressive transmission of data
  • Mining compressed data visualization approaches
    (Machiraju 01)
  • Feature extraction (at different scales)
  • texture analysis (Ma/Manjunath 95)
  • Image registration (Le Moigne 94)
  • Caveat Recent work (Candes/Donoho 00) indicates
    that wavelets might not be good for gt 1D data

Once the data has been de-noised, we need to
identify the objects in it
  • Identifying the objects is non-trivial
  • tremendous variability of object shapes man-made
    vs. natural objects
  • denoising may have smoothed the edges
  • variations in image quality (noise, boundary

Identifying objects in data is difficult, both in
2 and 3-D images and meshes
  • Challenges in traditional image algorithms
  • need many parameters for optimal performance
  • interactions between parameters are complex and
  • no universally accepted measure of quality of the
    segmented image
  • no single method can handle variations between
  • Identifying objects in mesh data
  • mesh may move/change over time
  • in two/three spatial dimensions time
  • irregular meshes
  • objects may split or merge

Several techniques are being used in the image
processing community
  • Histogram the image, and threshold it based on
    the histogram separate the foreground from the
  • Segmentation techniques
  • split and merge (top-down)
  • region growing (bottom-up)
  • Edge detection use a filter to identify an edge

Examples of some simple edge detection
Original Sobel
More sophisticated techniques for object
  • Combine traditional techniques with evolutionary
    algorithms to make them more adaptive (Bhanu/Lee
    95, Cagnoni 97)
  • Deformable models for segmentation
  • parametric approach snakes or active contours
    (Kass et. al 87)
  • geometric approach level set methods (Malladi
    and Sethian 96)
  • Non-linear diffusion filters based on PDEs
  • smooth images while enhancing edges (Weickert et.
    al 98)

? PDE-based techniques are gaining popularity -
they are robust, but expensive.
Once the objects have been identified, the
features must be extracted
  • Features dependent on the problem
  • identifying relevant features
  • extracting robust features
  • extracting features invariant to scale, rotation,
    and translation
  • Features may include
  • distances, angles, areas
  • histograms
  • fourier or wavelet coefficients
  • various moments
  • .

May need to reduce the dimension or the number
of features
Object recognition and Feature Extraction
Dimension Reduction
Pattern Recognition
Raw Data
Features Features
Data items
There are several reasons why dimension reduction
may be helpful
  • Fewer features may make pattern recognition
    algorithms computationally tractable
  • Less time is spent in extracting features
  • Can minimize correlations between features, which
    may be a requirement of some algorithms (e.g.

In the FIRST data, we need to reduce the 103
features for 3-entry sources
  • Input from domain experts
  • EDA techniques parallel plots and box plots
  • Wrapper approach

There are also more complex techniques for
dimension reduction
  • Principal component analysis
  • transform the features to be mutually
  • focus on directions that maximize the variance

Principal component analysis algorithm
  • N data items in d dimensions
  • find the d-dimensional mean vector
  • obtain the d x d covariance matrix
  • obtain the d eigenvalues and eigenvectors of the
    covariance matrix
  • keep k largest eigenvectors (k ltlt d)
  • project the (original data - mean) into the space
    spanned by these vectors

? The eigenvectors or principal components (PCs)
are mutually orthogonal and the original
data is a linear combination of these PCs
We applied PCA to the problem of bent-double
  • The first 20 PCs explained about 90 of the
  • Eliminate unimportant variables
  • eliminate variable with largest coefficient in
    e-vector corresponding to smallest e-value
  • repeat with the e-vector for the next smallest
  • continue till left with 20 variables

? Using the 31 features found through EDA and
PCA lowers the error from 11.1 to 9.5
Need more appropriate techniques for dimension
  • PCA may not always be appropriate
  • linear
  • orthogonal
  • Other options
  • independent component analysis
  • blind source separation
  • non-linear PCA
  • genetic algorithms
  • Need incremental techniques which are applied as
    the data is being collected (Kargupta 00)

It is difficult to find labeled data in science
and engineering applications
  • Training set usually generated manually, not
  • Not all scientists may agree on a label
  • Labeled data vs interestingdata
  • Often ground truth is unavailable, or difficult
    to find
  • Approach to labeling may be ad-hoc
  • the yellow-sticky-pad approach to identifying
    bent doubles

Non-bent double
Sapphire experiences with a flexible system
design for data mining
  • We address the needs of a diverse set of
  • Not all problems require the entire process
  • Not all algorithms are suitable for a problem
  • Algorithms typically depend on several parameters
  • Intermediate data must be handled properly
  • Domain dependent and independent parts must be
    clearly identified
  • Should be able to accommodate a growing data set

The Sapphire approach a flexible, portable,
scalable system architecture
User Input Feedback
Components linked by Python
Other pointers that discuss system architecture
  • Data mining specific projects
  • ADAM , JARTool, Diamond Eye
  • Workshops of more general interest
  • mining scientific datasets (httpwww.ahpcrc.umn.ed
  • interfaces to scientific data archives
  • large scientific databases (http//www.cacr.caltec
  • issues in the application of data mining to
    scientific data (http//
  • data fusion and data mining (http//ic-www.arc.nas

Challenges in mining science and engineering data
  • Feature extraction is non-trivial
  • Labeled data is difficult to obtain
  • Data can be high dimensional
  • Need techniques to handle spatial and temporal
  • System infrastructure issues are important
  • Data fusion and registration are required in some
  • Data may be compressed
  • May need to mine data as it is being generated
  • .

  • The Sapphire project team Erick Cantú-Paz, Imola
    K. Fodor, and Nu Ai Tang
  • Sisira Weeratunga (LLNL) for insights on
    simulations and PDEs
  • FIRST scientists Bob Becker, Michael Gregg,
    Sally Laurent-Muehleisen, and Rick White

UCRL-PRES-145087 The work of Chandrika Kamath
in Chapters 5, 6, and 7 was performed under the
auspices of the U.S. Department of Energy by the
University of California Lawrence Livermore
National Laboratory under contract No.
Chapter 7 - References
  • Credits for images used in Chapter 7 (if not
    provided with the image)
  • MACHO web page http//
  • 3D meshes http//
  • Structured and unstructured mesh around the front
    of an aircraft, http//
  • 3D unstructured mesh with heterogeneous elements
  • Composite grid SAMRAI project -
  • Wavelet images generated by Sapphire software -

Chapter 7 - References
  • Burl, M., L. Asker, P. Smyth, U. Fayyad, P.
    Perona, L. Crumpler, and J. Aubele, Learning to
    recognize volcanoes on Venus, Machine Learning,
    Volume 30, pages 165-195, 1998.
  • Langley, P. and H. A. Simon, Applications of
    machine learning and rule induction,
    Communications of the ACM, Volume 38, Number 11,
    pages 55-64.
  • Brodley, C. and P. Smyth, The process of
    applying machine learning algorithms, Workshop
    on applying machine learning in practice, IMLC
    1995 (http//
  • Kamath, C., E. Cantú-Paz, I. K. Fodor, and N.
    Tang, Searching for bent-double galaxies in the
    first survey, in Data Mining for Scientific and
    Engineering Applications, R. Grossman, C. Kamath,
    W. Kegelmeyer, V. Kumar, and R. Namburu (eds.),
    Kluwer 2001.

Chapter 7 - References
  • Brown, L. A Survey of Image Registration
    Techniques. ACM Computing Surveys, Vol. 24,
    Number 4, December 1992.
  • Le Moigne, J., Parallel Registration of
    Multi-sensor remotely senses imagery using
    wavelet coefficients, Proc. SPIE Wavelet
    Applications Conference, Orlando, 1994, pages
  • Mandava, V., Fitzpatrick, J., and Pickens, D.
    (1989). Adaptive search space scaling in digital
    image registration. IEEE Transactions on Medical
    Imaging, 8, 251-262.
  • Thevenaz, P., Ruttimann, U., Unser, M., A
    Pyramid Approach to Sub-pixel Registration based
    on intensity, IEEE Transactions on Image
    Processing, Vol 7, Number 1, January 1998.
  • Fodor, I.K. and C. Kamath, On denoising images
    using wavelet-based statistical techniques,
    submitted for publication. See the Sapphire web
    page for details.

Chapter 7 - References
  • Fodor, I.K. and C. Kamath, The role of
    multi-resolution in mining massive image
    datasets, Proceedings of the YES2000 Symposium
    on Advanced Multiscale and Multi-resolution
    Methods, Lecture Notes in Computational Science
    and Engineering, Springer-Verlag, 2001.
  • Machiraju, R. and J. Fowler, D. Thompson, W.
    Schroeder, and B. Soni, EVITA - A Prototype
    System for Efficient Visualization and
    Interrogation of Terascale Datasets, to appear
    in Data Mining for Scientific and Engineering
    Applications, R. Grossman, C. Kamath, W.
    Kegelmeyer, V. Kumar, and R. Namburu (eds.),
    Kluwer 2001.
  • Ma, W. Y. and B. S. Manjunath, A comparison of
    wavelet transform features for texture image
    annotation, Proc. Second International
    Conference on Image Processing, ICIP 95, pages

Chapter 7 - References
  • Candes, E. and Donoho, D. , Curvelets,
    multiresolution representation, and scaling
    laws, Proc. Wavelet Applications in Signal and
    Image Processing VIII, SPIE 2000, vol. 4119.
  • Bhanu, B. and S. Lee,Adaptive image segmentation
    using a genetic algorithm, IEEE Transactions on
    Systems, Man, and Cybernetics, 25, pages
    1543-1567, 1995.
  • Cagnoni, S., Dobrzeniecki, A, R. Poli, J. Yanch,
    Segmentation of 3D medical images through
    genetically optimized contour tracking
    algorithms, U. Birmingham School of Computer
    Science Technical Report, CSRP-97-28, 1997.
  • Kass, M., A. Witkin, and D. Terzopolous, Snakes
    active contour models,Intl J. Computer Vision,
    Volume 1. No. 4, pages 321-331, 1987.

Chapter 7 - References
  • Malladi, R. and J. Sethian, A unified approach
    to noise removal, image enhancement, and shape
    recovery, IEEE Transactions on Image Processing,
    Volume 5, 1996, pages 1154-1168.
  • Weickert, J, B. ter Haar Romeny, and M.
    Viergever, Efficient and Reliable Schemes for
    Nonlinear Diffusion Filtering, IEEE Transactions
    on Image Processing, Volume 7, Number 3, March
  • Joliffe, I., Principal Component Analysis,
    Springer Verlag, 1986.
  • Kargupta, H, W. Huang, S. Krishnamoorthy, and E.
    Johnson, Distributed clustering using collective
    principal component analysis, ACM SigKDD
    Workshop on Distributed and Parallel Knowledge
    Discovery, 2000.