Tuesday, December 7, 1999 - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Tuesday, December 7, 1999

Description:

Many others (Gini coefficient, dependence) Issues: Generation Scheme and Evaluation Measure ... Automobile Insurance Risk Analysis. Kansas State University ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 13
Provided by: lindajacks
Category:
Tags: december | tuesday

less

Transcript and Presenter's Notes

Title: Tuesday, December 7, 1999


1
Lecture 7
Discussion (KDD 2 of 3) Feature Selection for KDD
Tuesday, December 7, 1999 William H.
Hsu Department of Computing and Information
Sciences, KSU http//www.cis.ksu.edu/bhsu Readin
gs Paper 2 Liu and Motoda, Chapter 3
2
Lecture Outline
  • Readings Liu and Motoda
  • Feature Selection for Knowledge Discovery and
    Data Mining
  • Chapter 3 Feature selection aspects
  • What is Feature Selection?
  • Generation Scheme
  • How do we generate subsets?
  • Forward, backward, bidirectional, random,
    opportunistic
  • Evaluation Measure
  • How do we tell how good a candidate subset is?
  • Accuracy, consistency, scores (information gain,
    cross entropy, variance, etc.)
  • Search Strategy
  • How do we systematically search for a good
    subset?
  • Blind (uninformed) search
  • Heuristic (informed) search
  • Next Class Presentation

3
What is Feature Selection?
  • Problem Choosing Inputs x for Supervised
    Learning
  • Applications
  • Concept learning for monitoring
  • Extraction of temporal features
  • Sensor and data fusion
  • Solutions
  • Decomposition of spatiotemporal data
  • Attribute-driven problem redefinition
  • Constructive induction
  • Model selection
  • Approach
  • Hierarchy of temporal submodels
  • Probabilistic subnetworks
  • ANNs
  • Bayesian networks
  • Quantitative (metric-based) model selection

x
4
Lecture Outline
  • Readings Liu and Motoda
  • Feature Selection for Knowledge Discovery and
    Data Mining
  • Chapter 3 Feature selection aspects
  • What is Feature Selection?
  • Generation Scheme
  • How do we generate subsets?
  • Forward, backward, bidirectional, random,
    opportunistic
  • Evaluation Measure
  • How do we tell how good a candidate subset is?
  • Accuracy, consistency, scores (information gain,
    cross entropy, variance, etc.)
  • Search Strategy
  • How do we systematically search for a good
    subset?
  • Blind (uninformed) search
  • Heuristic (informed) search
  • Next Class Presentation

5
IssuesGeneration Scheme and Evaluation Measure
  • Generation Scheme
  • Directed subset construction
  • Forward start with Ø and grow until U(S) is
    high enough
  • Backward start with S and shrink while U(S) is
    still high enough
  • Bidirectional meet in the middle (S, F
    boundaries)
  • Random iterative improvement (cf. simulated
    annealing) using F
  • Opportunistic prior knowledge guides generation
    (compare heuristic search)
  • Evaluation Measure
  • r (MAXR) ? (xi, y) Cov (xi, y) / sqrt (Var
    (xi) Var (y))
  • Accuracy
  • Consistency
  • Classical scores
  • Information gain
  • Cross entropy
  • Variance
  • Many others (Gini coefficient, dependence)

6
Search
Subset Inclusion State Space Poset Relation Set
Inclusion A ? B B is a subset of A Up
operator DELETE Down operator ADD
7
Feature Selection and Construction as
Unsupervised Learning
  • Unsupervised Learning in Support of Supervised
    Learning
  • Given D ? labeled vectors (x, y)
  • Return D ? new training examples (x, y)
  • Constructive induction transformation step in
    KDD
  • Feature construction generic term
  • Cluster definition
  • Feature Construction Front End
  • Synthesizing new attributes
  • Logical x1 ? ? x2, arithmetic x1 x5 / x2
  • Other synthetic attributes f(x1, x2, , xn),
    etc.
  • Dimensionality-reducing projection, feature
    extraction
  • Subset selection finding relevant attributes for
    a given target y
  • Partitioning finding relevant attributes for
    given targets y1, y2, , yp
  • Cluster Definition Back End
  • Form, segment, and label clusters to get
    intermediate targets y
  • Change of representation find good (x, y) for
    learning target y

x / (x1, , xp)
8
Wrappers for Performance Enhancement
  • Wrappers
  • Outer loops for improving inducers
  • Use inducer performance to optimize
  • Applications of Wrappers
  • Combining knowledge sources
  • Committee machines (static) bagging, stacking,
    boosting
  • Other sensor and data fusion
  • Tuning hyperparameters
  • Number of ANN hidden units
  • GA control parameters
  • Priors in Bayesian learning
  • Constructive induction
  • Attribute (feature) subset selection
  • Feature construction
  • Implementing Wrappers
  • Search Kohavi, 1995
  • Genetic algorithm

9
Supervised Learning Framework
10
Case StudyAutomobile Insurance Risk Analysis
11
Terminology
  • Supervised Learning
  • Inducer
  • Supervised inductive learning framework (L, H)
  • L learning algorithm, H hypothesis space
    (language)
  • Relevance determination finding inputs that are
    important to performance element (e.g.,
    regression or classification)
  • Feature Selection
  • Related terms feature, attribute, variable
  • Definition problem of determining for given
    inducer which
  • Related problems feature extraction,
    construction (synthesis), partitioning
  • Methods for Feature Selection
  • Feature ranking
  • Subset selection minimum subset (Min-Set)
  • Set generation (regression) sequential forward
    (forward selection), sequential backward
    (backward elimination), bidirectional, random
  • Search strategies uninformed, informed
  • Filters vs. wrappers

12
Summary Points
  • Feature Selection and Knowledge Discovery in
    Databases (KDD)
  • Virtuous cycle of data mining iterative
    refinement
  • Feedback from supervised learning
  • Role of Feature Selection in Data Mining
  • Relevance determination
  • Methodologies
  • Filters vs. wrappers
  • Generation scheme, evaluation measure, search
    strategy
  • Resources Online
  • MLC
  • FSS wrapper
  • Many inducers, including ID3, OC1
  • http//www.sgi.com/Technology/mlc
  • Jenesis
  • Part of NCSA D2K http//lorax.ncsa.uiuc.edu
  • KSU KDD Group http//www.kddresearch.org/Info
  • C4.5 / C5.0 http//www.rulequest.com
Write a Comment
User Comments (0)
About PowerShow.com