Instance Construction via LikelihoodBased Data Squashing - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Instance Construction via LikelihoodBased Data Squashing

Description:

LDS: Likelihood based data squashing. Keywords. Instance Construction, Data Squashing ... LDS Algorithm. Motivation: Bayesian rule ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 14
Provided by: jsy6
Category:

less

Transcript and Presenter's Notes

Title: Instance Construction via LikelihoodBased Data Squashing


1
Instance Construction via Likelihood-Based Data
Squashing
  • Madigan D., et. al.
  • (Ch 12, Instance selection and
    Construction for Data Mining (2001), Kruwer
    Academic Publishers)
  • Summarize Jinsan Yang, SNU
    Biointelligence Lab

2
  • Abstract
  • Data Compression Method Squashing
  • LDS Likelihood based data squashing
  • Keywords
  • Instance Construction, Data Squashing

3
Outline
  • Introduction
  • The LDS Algorithm
  • Evaluation Logistic Regression
  • Evaluation Neural Networks
  • Iterative LDS
  • Discussion

4
Introduction
  • Massive data examples
  • Large-scale retailing
  • Telecommunications
  • Astronomy
  • Computational biology
  • Internet logging
  • Some computational challenges
  • Need of multiple passes for data access
  • 1056 times slower than main memory
  • Current SolutionScaling up existing algorithm
  • Here Scaling down the data
  • Data squashing 750000 ? 8443 ( DuMouchel et al
    (1999),
  • Outperforms by a factor of 500 in MSE than
    random sample of size 7543

5
LDS Algorithm
  • Motivation Bayesian rule
  • Given three data points d1,d2,d3, estimate the
    parameter
  • Clusters by likelihood profile

6
LDS Algorithm
  • Details of LDS Algorithm
  • Select Values of by a central composite
    design

Central composite Design for 3 factors
7
LDS Algorithm
  • Profile Evaluate the likelihood profiles
  • Cluster Cluster the mother data in a single
    pass
  • Select n random samples as initial cluster
    centers
  • Assign the remaining data to each cluster
  • Construct Construct the Pseudo data
  • cluster center

8
Evaluation Logistic Regression
  • Small-scale simulations
  • Initial estimate of
  • Plot Log (Error Ratio)
  • Three methods of initial parameter estimations
  • 100 data / 48 squashed data

9
Evaluation Logistic Regression
  • Medium Scale 100000 , base 1 simple random
    sampling

10
Evaluation Logistic Regression
  • Large Scale 744963 , base 1 simple random
    sampling

11
Evaluation Neural Networks
  • Feed forward, two input nodes, one hidden layer
    with 3 units,
  • Single binary output
  • Mother data 10000, Squashed data 1000,
    repetitions30
  • test data 1000 from the same network
  • Comparisons for P(whole) - P(reduced)

12
Evaluation Neural Networks

13
Iterative LDS
  • When the estimation of is not accurate.
  • 1. Set from simple random sampling
  • 2. Squash by LDS
  • 3. Estimate
  • 4. Go to 2.
Write a Comment
User Comments (0)
About PowerShow.com