Efficient and Effective Itemset Pattern Summarization: Regressionbased Approaches - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient and Effective Itemset Pattern Summarization: Regressionbased Approaches

Description:

Efficient and Effective Itemset Pattern Summarization: Regression-based Approaches ... We use 2-norm in this study. Probabilistic Restoration Function ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 20
Provided by: csK4
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Efficient and Effective Itemset Pattern Summarization: Regressionbased Approaches


1
Efficient and Effective Itemset Pattern
Summarization Regression-based Approaches
  • Ruoming Jin
  • Kent State University
  • Joint work with Muad Abu-Ata, Yang Xiang, and
    Ning Ruan (KSU)

2
Problem Definition
  • Given a large collection of frequent itemsets and
    their supports, how we can concisely represent
    them?
  • Coverage criterion
  • The Spanning Set Approach F. Arati, A. Gionis,
    Mannila, Approximating a collection of frequent
    sets, KDD04.
  • Frequency criterion
  • The Profile-based Approach X. Yan, H. Cheng, J.
    Han, and D. Xin, Summarizing itemset patterns, a
    profile-based approach, KDD05.
  • The Markov Random Field Approach C. Wang and S.
    Parthasarathy, Summarizing itemset patterns using
    probabilistic models, KDD06.

3
Frequency Criterion
  • The restoration function of a set of itemsets S
    is a function
  • The restoration error
  • We use 2-norm in this study.

4
Probabilistic Restoration Function
  • Applying the independence probabilistic model for
    a set of itemsets S
  • An example,

5
Problem 1 Optimal Parameters
What are the optimal parameters,
p(S),p(a),p(c),p(d), minimizing the restoration
error
6
Non-Linear Regression
  • We introduce the independent variable
  • We have S data points.

7
Linear Regression Approximation
Using Taylor expansion, we show the restoration
error from linear regression is very close to
the error by using the non-linear regression!
8
Problem 2 Optimal Partition
  • To reduce the restoration error, we adopt the
    partition strategy
  • Partition the entire collection of frequent
    itemsets into K disjoint subsets, and build the
    restoration function for each subset
  • How to optimally partition a set of itemsets into
    K disjoint subsets so that the total restoration
    error can be minimized?

9
Our Approaches
  • NP-hard problem
  • Two heuristic algorithms
  • K-Regression
  • Tree Regression

10
K-Regression
  • A k-means type clustering procedure
  • Random partition the set of itemsets S into K
    partition
  • Regression Step Apply regression to find the
    optimal parameters on each partition
  • Re-assignment Step For each itemset, assign it
    to the partition which minimizes its restoration
    error based on the optimal parameters discovered
    by Step 2
  • Repeat 2 and 3 until the total restoration error
    does not increase or the improvement is small
  • Just as k-means, k-regression is guaranteed
    to converge!

11
Tree Regression
Sa,b,c,d,a,b,a,c,b,c,a,d,c,d,
a,b,c,a,b,d,a,c,d
Using Regression to find optimal parameters for
each subset of itemsets
12
Tree Regression Construction
  • A Decision-type of construction algorithm
  • Question 1 How to find K subsets of itemsets?
  • Question 2 How to find the optimal splitting?
  • Answer for Q1
  • Maintain a queue for the current leaf node, and
    always pick up the leaf nodes with the maximal
    average restoration error to split
  • Answer for Q2
  • Maximally reduce the total restoration error
  • Min E(S)-E(S_1)-E(S_2)

13
An Interesting Connection
  • Jerome H. Friedmans 1977 paper, A
    tree-structured Approach to nonparametric
    multiple regression.
  • Unfortunately, this work seems never got enough
    attention. However, it seems part of the
    inspiration for the CART (regression tree) and
    MARS (Multivariate Adaptive Regression Spline).

14
Experimental Results
15
Chess Restoration Error
16
BMS-POS Restoration Error
17
BMS-POS Running Time
18
Conclusion
  • Using linear regression to identify optimal
    parameters of the probabilistic restoration
    function (based on the independence assumption)
    for a set of itemsets
  • Two algorithms to optimally partition the set of
    itemsets into K parts
  • K-regression
  • Tree regression

19
Thanks!!
Write a Comment
User Comments (0)
About PowerShow.com