Very large data sets - PowerPoint PPT Presentation

About This Presentation
Title:

Very large data sets

Description:

Birch. Clarans. On-line EM. Scalable EM. GMG. University of Joensuu. Dept. ... T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: A New Data Clustering Algorithm and ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 18
Provided by: csJoe
Category:
Tags: birch | data | sets

less

Transcript and Presenter's Notes

Title: Very large data sets


1
Very large data sets
Clustering methods Part 10
Pasi Fränti
5.5.2014
  • Speech and Image Processing UnitSchool of
    Computing
  • University of Eastern Finland

2
Methods for large data sets
  • Birch
  • Clarans
  • On-line EM
  • Scalable EM
  • GMG

3
Gradual model generator (GMG) Kärkkäinen
Fränti, 2007 Pattern Recognition
4
Goal of the GMG algorithm
EM
GMG
5
Contours of probability density distributions
EM
GMG
6
Model update
  • New data points are mapped immediately when
    input.
  • Points too far (from any model) will remain in
    buffer.
  • Buffered points are re-tested when new models
    created.

Before update
After update
7
Generating new components
  • When buffer full, selected points are used to
    generate new components.
  • Most compact k-neighborhood is selected as seed
    for a new component.

Selected points and a new component
Data in buffer
8
Example
9
Example
10
Example
11
Example
12
Example
13
Example
14
Post-processing
Model before processing
15
Post-processing
Model before processing
Updated model
16
Post-processing
Model before processing
Updated model data
17
Literature
  1. I. Kärkkäinen and P. Fränti, "Gradual model
    generator for single-pass clustering", Pattern
    Recognition, 40 (3), 784-795, March 2007.
  2. P. Bradley, U. Fayyad, C. Reina, Clustering Very
    Large Databases Using EM Mixture Models, Proc. of
    the 15th Int. Conf. on Pattern Recognition, vol.
    2, 2000, pp. 76-80.
  3. R. Ng, J. Han, CLARANS A Method for Clustering
    Objects for Spatial Data Mining, IEEE Trans.
    Knowledge Data Engineering 14(5) (2002)
    1003-1016.
  4. M. Sato, S. Ishii, On-line EM Algorithm for the
    Normalized Gaussian Network, Neural Computation
    12(2) (2000) 407-432.
  5. T. Zhang, R. Ramakrishnan, M. Livny, BIRCH A New
    Data Clustering Algorithm and Its Applications,
    Data Mining and Knowledge Discovery 1(2) (1997)
    141-182.
Write a Comment
User Comments (0)
About PowerShow.com