Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison - PowerPoint PPT Presentation

About This Presentation
Title:

Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison

Description:

See the notes for information on how the s are organized. – PowerPoint PPT presentation

Number of Views:172
Avg rating:3.0/5.0
Slides: 60
Provided by: csTuftsE
Learn more at: https://www.cs.tufts.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison


1
Data Warehousing/Mining Comp 150 DW Chapter 5
Concept Description Characterization and
Comparison
  • Instructor Dan Hebert

2
Chapter 5 Concept Description Characterization
and Comparison
  • What is concept description?
  • Data generalization and summarization-based
    characterization
  • Analytical characterization Analysis of
    attribute relevance
  • Mining class comparisons Discriminating between
    different classes
  • Mining descriptive statistical measures in large
    databases
  • Discussion
  • Summary

3
What is Concept Description?
  • Descriptive vs. predictive data mining
  • Descriptive mining describes concepts or
    task-relevant data sets in concise, summarative,
    informative, discriminative forms
  • Predictive mining Based on data and analysis,
    constructs models for the database, and predicts
    the trend and properties of unknown data
  • Concept description
  • Characterization provides a concise and succinct
    summarization of the given collection of data
  • Comparison provides descriptions comparing two
    or more collections of data

4
Concept Description vs. OLAP
  • Concept description
  • can handle complex data types of the attributes
    and their aggregations
  • a more automated process
  • OLAP
  • restricted to a small number of dimension and
    measure types
  • user-controlled process

5
Data Generalization and Summarization-based
Characterization
  • Data generalization
  • A process which abstracts a large set of
    task-relevant data in a database from a low
    conceptual levels to higher ones.
  • Approaches
  • Data cube approach(OLAP approach)
  • Attribute-oriented induction approach

1
2
3
4
Conceptual levels
5
6
Characterization Data Cube Approach (without
using Attribute Oriented-Induction)
  • Perform computations and store results in data
    cubes
  • Strength
  • An efficient implementation of data
    generalization
  • Computation of various kinds of measures
  • e.g., count( ), sum( ), average( ), max( )
  • Generalization and specialization can be
    performed on a data cube by roll-up and
    drill-down
  • Limitations
  • handle only dimensions of simple nonnumeric data
    and measures of simple aggregated numeric values.
  • Lack of intelligent analysis, cant tell which
    dimensions should be used and what levels should
    the generalization reach

7
Attribute-Oriented Induction
  • Proposed in 1989 (KDD 89 workshop)
  • Not confined to categorical data nor particular
    measures.
  • How it is done?
  • Collect the task-relevant data( initial relation)
    using a relational database query
  • Perform generalization by attribute removal or
    attribute generalization.
  • Apply aggregation by merging identical,
    generalized tuples and accumulating their
    respective counts.
  • Interactive presentation with users.

8
Basic Principles of Attribute-Oriented Induction
  • Data focusing task-relevant data, including
    dimensions, and the result is the initial
    relation.
  • Attribute-removal remove attribute A if there is
    a large set of distinct values for A but (1)
    there is no generalization operator on A, or (2)
    As higher level concepts are expressed in terms
    of other attributes.
  • Attribute-generalization If there is a large set
    of distinct values for A, and there exists a set
    of generalization operators on A, then select an
    operator and generalize A.
  • Attribute-threshold control typical 2-8,
    specified/default.
  • Generalized relation threshold control control
    the final relation/rule size.

9
Basic Algorithm for Attribute-Oriented Induction
  • InitialRel Query processing of task-relevant
    data, deriving the initial relation.
  • PreGen Based on the analysis of the number of
    distinct values in each attribute, determine
    generalization plan for each attribute removal?
    or how high to generalize?
  • PrimeGen Based on the PreGen plan, perform
    generalization to the right level to derive a
    prime generalized relation, accumulating the
    counts.
  • Presentation User interaction (1) adjust levels
    by drilling, (2) pivoting, (3) mapping into
    rules, cross tabs, visualization presentations.

10
Example
  • DMQL Describe general characteristics of
    graduate students in the Big-University database
  • use Big_University_DB
  • mine characteristics as Science_Students
  • in relevance to name, gender, major, birth_place,
    birth_date, residence, phone, gpa
  • from student
  • where status in graduate
  • Corresponding SQL statement
  • Select name, gender, major, birth_place,
    birth_date, residence, phone, gpa
  • from student
  • where status in Msc, MBA, PhD

11
Class Characterization An Example
Initial Relation
Prime Generalized Relation
12
Presentation of Generalized Results
  • Generalized relation
  • Relations where some or all attributes are
    generalized, with counts or other aggregation
    values accumulated.
  • Cross tabulation
  • Mapping results into cross tabulation form
    (similar to contingency tables).
  • Visualization techniques
  • Pie charts, bar charts, curves, cubes, and other
    visual forms.
  • Quantitative characteristic rules
  • Mapping generalized result into characteristic
    rules with quantitative information associated
    with it, e.g.,

13
Presentation of Generalized Results(continued)
  • t-weight
  • Interesting measure that describes the typicality
    of
  • each disjunct in the rule
  • each tuple in the corresponding generalized
    relation
  • n number of tuples for target class for
    generalized relation
  • qi qn tuples for target class in generalized
    relation
  • qa is in qi qn

14
PresentationGeneralized Relation
15
PresentationCrosstab
16
Implementation by Cube Technology
  • Construct a data cube on-the-fly for the given
    data mining query
  • Facilitate efficient drill-down analysis
  • May increase the response time
  • A balanced solution precomputation of subprime
    relation
  • Use a predefined precomputed data cube
  • Construct a data cube beforehand
  • Facilitate not only the attribute-oriented
    induction, but also attribute relevance analysis,
    dicing, slicing, roll-up and drill-down
  • Cost of cube computation and the nontrivial
    storage overhead

17
Characterization vs. OLAP
  • Similarity
  • Presentation of data summarization at multiple
    levels of abstraction.
  • Interactive drilling, pivoting, slicing and
    dicing.
  • Differences
  • Automated desired level allocation.
  • Dimension relevance analysis and ranking when
    there are many relevant dimensions.
  • Sophisticated typing on dimensions and measures.
  • Analytical characterization data dispersion
    analysis.

18
Attribute Relevance Analysis
  • Why?
  • Which dimensions should be included?
  • How high level of generalization?
  • Automatic vs. interactive
  • Reduce attributes easy to understand patterns
  • What?
  • statistical method for preprocessing data
  • filter out irrelevant or weakly relevant
    attributes
  • retain or rank the relevant attributes
  • relevance related to dimensions and levels
  • analytical characterization, analytical
    comparison

19
Attribute relevance analysis (contd)
  • How?
  • Data Collection
  • Analytical Generalization
  • Use information gain analysis (e.g., entropy or
    other measures) to identify highly relevant
    dimensions and levels.
  • Relevance Analysis
  • Sort and select the most relevant dimensions and
    levels.
  • Attribute-oriented Induction for class
    description
  • On selected dimension/level
  • OLAP operations (e.g. drilling, slicing) on
    relevance rules

20
Relevance Measures
  • Quantitative relevance measure determines the
    classifying power of an attribute within a set of
    data.
  • Methods
  • information gain (ID3)
  • gain ratio (C4.5)
  • gini index
  • ?2 contingency table statistics
  • uncertainty coefficient

21
Information-Theoretic Approach
  • Decision tree
  • each internal node tests an attribute
  • each branch corresponds to attribute value
  • each leaf node assigns a classification
  • ID3 algorithm
  • build decision tree based on training objects
    with known class labels to classify testing
    objects
  • rank attributes with information gain measure
  • minimal height
  • the least number of tests to classify an object

22
Top-Down Induction of Decision Tree
Attributes Outlook, Temperature, Humidity,
Wind
PlayTennis yes, no
23
Entropy and Information Gain
  • S contains si tuples of class Ci for i 1, ,
    m
  • Information measures info required to classify
    any arbitrary tuple
  • Entropy (weighted average) of attribute A with
    values a1,a2,,av
  • Information gained by branching on attribute A
  • gtinfo gained gt discriminating attribute

24
Example Analytical Characterization
  • Task
  • Mine general characteristics describing graduate
    students using analytical characterization
  • Given
  • attributes name, gender, major, birth_place,
    birth_date, phone, and gpa
  • Gen(ai) concept hierarchies on ai
  • Ui attribute analytical thresholds for ai
  • Ti attribute generalization thresholds for ai
  • R attribute relevance threshold

25
Example Analytical Characterization (contd)
  • 1. Data collection
  • target class graduate student
  • contrasting class undergraduate student
  • 2. Analytical generalization using Ui
  • attribute removal
  • remove name and phone
  • attribute generalization
  • generalize major, birth_place, birth_date and
    gpa
  • accumulate counts
  • candidate relation gender, major, birth_country,
    age_range and gpa

26
Example Analytical characterization (2)
Candidate relation for Target class Graduate
students (?120)
Candidate relation for Contrasting class
Undergraduate students (?130)
27
Example Analytical characterization (3)
  • 3. Relevance analysis
  • Calculate expected info required to classify an
    arbitrary tuple
  • Calculate entropy of each attribute e.g. major

28
Example Analytical Characterization (4)
  • Calculate expected info required to classify a
    given sample if S is partitioned according to the
    attribute
  • Calculate information gain for each attribute
  • Information gain for all attributes

0
0.9892
0.9183
0.7873
0.9988
29
Example Analytical characterization (5)
  • 4. Initial working relation (W0) derivation
  • R (attribute relevance threshold) 0.1
  • remove irrelevant/weakly relevant attributes from
    candidate relation gt drop gender, birth_country
  • remove contrasting class candidate relation
  • 5. Perform attribute-oriented induction on W0
    using Ti

Initial target class working relation W0
Graduate students
30
Mining Class Comparisons
  • Comparison Comparing two or more classes.
  • Method
  • Partition the set of relevant data into the
    target class and the contrasting class(es)
  • Generalize both classes to the same high level
    concepts
  • Compare tuples with the same high level
    descriptions
  • Present for every tuple its description and two
    measures
  • support - distribution within single class
  • comparison - distribution between classes
  • Highlight the tuples with strong discriminant
    features
  • Relevance Analysis
  • Find attributes (features) which best distinguish
    different classes.

31
Example Analytical comparison
  • Task
  • Compare graduate and undergraduate students using
    discriminant rule.
  • DMQL query

use Big_University_DB mine comparison as
grad_vs_undergrad_students in relevance to
name, gender, major, birth_place, birth_date,
residence, phone, gpa for graduate_students whe
re status in graduate versus undergraduate_stud
ents where status in undergraduate analyze
count from student
32
Example Analytical comparison (2)
  • Given
  • attributes name, gender, major, birth_place,
    birth_date, residence, phone and gpa
  • Gen(ai) concept hierarchies on attributes ai
  • Ui attribute analytical thresholds for
    attributes ai
  • Ti attribute generalization thresholds for
    attributes ai
  • R attribute relevance threshold

33
Example Analytical comparison (3)
  • 1. Data collection
  • target and contrasting classes
  • 2. Attribute relevance analysis
  • remove attributes name, gender, major, phone
  • 3. Synchronous generalization
  • controlled by user-specified dimension thresholds
  • prime target and contrasting class(es)
    relations/cuboids

34
Example Analytical comparison (4)
Prime generalized relation for the target class
Graduate students
Prime generalized relation for the contrasting
class Undergraduate students
35
Example Analytical comparison (5)
  • 4. Drill down, roll up and other OLAP operations
    on target and contrasting classes to adjust
    levels of abstractions of resulting description
  • 5. Presentation
  • as generalized relations, crosstabs, bar charts,
    pie charts, or rules
  • contrasting measures to reflect comparison
    between target and contrasting classes
  • e.g. count

36
Quantitative Discriminant Rules
  • Cj target class
  • qa a generalized tuple covers some tuples of
    class
  • but can also cover some tuples of contrasting
    class
  • d-weight
  • range 0.0, 1.0 or 0, 100

37
Quantitative Discriminant Rules
  • High d-weight in target class indicates that
    concept represented by generalized tuple is
    primarily derived from target class
  • Low d-weight implies concept is derived from
    contrasting class
  • Threshold can be set to control the display of
    interesting tuples
  • quantitative discriminant rule form
  • Read if X satisfies condition, there is a
    probability (d-weight) that x is in the target
    class

38
Example Quantitative Discriminant Rule
Count distribution between graduate and
undergraduate students for a generalized tuple
  • Quantitative discriminant rule
  • where 90/(90210) 30

39
Example Quantitative Description Rule
  • Quantitative description rule for target class
    Europe

Crosstab showing associated t-weight, d-weight
values and total number (in thousands) of TVs and
computers sold at AllElectronics in 1998
40
Mining Data Dispersion Characteristics
  • Motivation
  • To better understand the data central tendency,
    variation and spread
  • Data dispersion characteristics
  • median, max, min, quantiles, outliers, variance,
    etc.
  • Numerical dimensions correspond to sorted
    intervals
  • Data dispersion analyzed with multiple
    granularities of precision
  • Boxplot or quantile analysis on sorted intervals
  • Dispersion analysis on computed measures
  • Folding measures into numerical dimensions
  • Boxplot or quantile analysis on the transformed
    cube

41
Measuring the Central Tendency
  • Mean
  • Weighted arithmetic mean
  • Median A holistic measure
  • Middle value if odd number of values, or average
    of the middle two values otherwise
  • estimated by interpolation
  • Mode
  • Value that occurs most frequently in the data
  • Unimodal, bimodal, trimodal
  • Empirical formula

42
Measuring the Dispersion of Data
  • Quartiles, outliers and boxplots
  • Quartiles Q1 (25th percentile), Q3 (75th
    percentile)
  • Inter-quartile range IQR Q3 Q1
  • Five number summary min, Q1, M, Q3, max
  • Boxplot ends of the box are the quartiles,
    median is marked, whiskers, and plot outlier
    individually
  • Outlier usually, a value higher/lower than 1.5 x
    IQR
  • Variance and standard deviation
  • Variance s2 (algebraic, scalable computation)
  • Standard deviation s is the square root of
    variance s2

43
Boxplot Analysis
  • Five-number summary of a distribution
  • Minimum, Q1, M, Q3, Maximum
  • Boxplot
  • Data is represented with a box
  • The ends of the box are at the first and third
    quartiles, i.e., the height of the box is IQR
    (interquartile range)
  • The median is marked by a line within the box
  • Whiskers two lines outside the box extend to
    Minimum and Maximum

44
A Boxplot
45
DBMiner
  • 4 examples of Boxplot Analysis

46
Mining Descriptive Statistical Measures in Large
Databases
  • Variance
  • Standard deviation the square root of the
    variance
  • Measures spread about the mean
  • It is zero if and only if all the values are
    equal
  • Both the deviation and the variance are algebraic

47
Histogram Analysis
  • Graph displays of basic statistical class
    descriptions
  • Frequency histograms
  • A univariate graphical method
  • Consists of a set of rectangles that reflect the
    counts or frequencies of the classes present in
    the given data

48
DBMiner
  • 2 examples of Histogram Analysis

49
Quantile Plot
  • Displays all of the data (allowing the user to
    assess both the overall behavior and unusual
    occurrences)
  • Plots quantile information
  • For a data xi data sorted in increasing order, fi
    indicates that approximately 100 fi of the data
    are below or equal to the value xi

50
Quantile-Quantile (Q-Q) Plot
  • Graphs the quantiles of one univariate
    distribution against the corresponding quantiles
    of another
  • Allows the user to view whether there is a shift
    in going from one distribution to another

51
Scatter plot
  • Provides a first look at bivariate data to see
    clusters of points, outliers, etc
  • Each pair of values is treated as a pair of
    coordinates and plotted as points in the plane

52
Loess Curve
  • Adds a smooth curve to a scatter plot in order to
    provide better perception of the pattern of
    dependence
  • Loess curve is fitted by setting two parameters
    a smoothing parameter, and the degree of the
    polynomials that are fitted by the regression

53
Graphic Displays of Basic Statistical Descriptions
  • Histogram (shown before)
  • Boxplot (covered before)
  • Quantile plot each value xi is paired with fi
    indicating that approximately 100 fi of data
    are ? xi
  • Quantile-quantile (q-q) plot graphs the
    quantiles of one univariant distribution against
    the corresponding quantiles of another
  • Scatter plot each pair of values is a pair of
    coordinates and plotted as points in the plane
  • Loess (local regression) curve add a smooth
    curve to a scatter plot to provide better
    perception of the pattern of dependence

54
AO Induction vs. Learning-from-example Paradigm
  • Difference in philosophies and basic assumptions
  • Positive and negative samples in
    learning-from-example positive used for
    generalization, negative - for specialization
  • Positive samples only in data mining hence
    generalization-based, to drill-down backtrack the
    generalization to a previous state
  • Difference in methods of generalizations
  • Machine learning generalizes on a tuple by tuple
    basis
  • Data mining generalizes on an attribute by
    attribute basis

55
Comparison of Entire vs. Factored Version Space
56
Incremental and Parallel Mining of Concept
Description
  • Incremental mining revision based on newly added
    data ?DB (delta DB)
  • Generalize ?DB to the same level of abstraction
    in the generalized relation R to derive ?R
  • Union R U ?R, i.e., merge counts and other
    statistical information to produce a new relation
    R
  • Similar philosophy can be applied to data
    sampling, parallel and/or distributed mining, etc.

57
Summary
  • Concept description characterization and
    discrimination
  • OLAP-based vs. attribute-oriented induction
  • Efficient implementation of AOI
  • Analytical characterization and comparison
  • Mining descriptive statistical measures in large
    databases
  • Discussion
  • Incremental and parallel mining of description
  • Descriptive mining of complex types of data

58
Homework Assignment
  • This homework assignment will utilize the data
    warehouse you previously built incorporating the
    hurricane data
  • Implement an automated Attribute-Oriented
    Induction capability on your hurricane data
  • Input needed
  • Your relation tables for the hurricane data
  • A DMQuery for characterization
  • A list of attributes
  • A set of concept hierarchies or generalization
    operators
  • A generalized relation threshold
  • Attribute generalization thresholds for each
    attribute

59
Homework Assignment
  • Transform the DMQL statement to a relational
    query (can do this by hand)
  • Use this relational query in your program to
    retrieve data and then perform the AOI (see
    algorithm on pg. 188 of book)
  • Visualize the results of the AOI via a
    generalized relation
  • Due April 22
Write a Comment
User Comments (0)
About PowerShow.com