Christoph F. Eick, Walter D. Sanz, and Ruijian Zhang - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Christoph F. Eick, Walter D. Sanz, and Ruijian Zhang

Description:

A Genetic Programming System for Building Block Analysis to Enhance Data Analysis and Data Mining Techniques Christoph F. Eick, Walter D. Sanz, and Ruijian Zhang – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 28

Provided by: eic96

Learn more at: http://www.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: Christoph F. Eick, Walter D. Sanz, and Ruijian Zhang

1
A Genetic Programming System for Building Block
Analysis to Enhance Data Analysis and Data Mining
Techniques

Christoph F. Eick, Walter D. Sanz, and Ruijian
Zhang
www.cs.uh.edu/ceick/matta.html
University of Houston
Organization
1. Introduction
2. General Approach and Corresponding
Methodology
3. Wols --- A Genetic Programming System to
Find Building Blocks
4. Experimental Results
5. Summary and Conclusion

2
1. Introduction

Regression analysis, which is also know as curve
fitting, may be broadly defined as the analysis
of relationships among variables. The actual
specification of the form of the function weather
it be linear, quadratic, or polynomial, is left
up entirely to the user.
Symbolic regression is the process of discovering
both the functional form of a target function and
all of its necessary coefficients. Symbolic
regression involves finding a mathematical
expression in symbolic form, that provides a
good, best or perfect fit between a finite
sampling of values of the input variables and the
associated values of the dependent variables.
Genetic Programming simulates the Darwinian
evolutionary process and naturally occurring
genetic operations on computer programs. Genetic
programming provides a way to search for the
fittest computer program in a search space which
consists of all possible computer programs
composed of functions and terminals appropriate
to the problem domain.

3
Example -- Evolving Syntax Trees with Genetic
Programming
4
Motivating Example
Task Approximate boundaries between classes with
rectangles
C1
C1
C2
C2
Easy!
Difficult!
Idea Transform the representation space so that
the approximation problem becomes simpler (e.g.
instead (x,y), just use (x-y)). Problem How do
we find such representation space transformations?
5
Building Blocks

Definition Building Blocks are patterns which
occur with a high frequency in a
set of solutions.
Example
(- ( C D) ( A B))
(- C ( B A))
(- ( A B) (COS A))
(- ( C (/ E (SQRT D))) ( A B))
Leaf Building Block ( A B)
Root Originating Building Block (- ? ( A B))
FUNC1 ( A B)

6
Constructive Induction

Definition Constructive Induction is a general
approach for coping with inadequate attributes,
redundant, and useless attributes found in
original data. Constructive Induction explores
various representation spaces. Building block
analysis and constructive induction are used in
this research to transform the original data by
adding new attributes and by dropping useless
ones.
Example

7
2. General Approach andCorresponding Methodology

Primary Learning System the one we use for the
task at hand (e.g. C5.0)
Secondary Learning System is used to find
building blocks
General Steps of the Methodology
0. Run the primary learning system for the data
collection. Record prediction performance and
good solutions.
1. Run the genetic programming/symbolic
regression/building block analysis system
(secondary learning system) several times for the
data collection with a rich function set. Record
prediction performance and good solutions.
2. Analyze the good solutions found by the
secondary learning system for the occurrence of
building blocks.
3. Using the results of step 2, transform the
data collection by generating new attributes.
4. Redo steps 0-3 for the modified data
collection, as long as a significant improvement
occurs

8
3. Wols --- A Genetic Programming System that
finds Bulding Blocks
9
Wols Major Inputs
10
Features of the Wols System

The WOLS system is basically a modified version
of the standard genetic program model. A
standard genetic program model employs crossover
and mutation operators that evolve solutions
(here approximation functions) relying on the
principles of the survival of the fittest better
solutions reproduce with a higher probability.
The WOLS system employs the traditional sub-tree
swapping crossover operator, but provides a
quite sophisticated set of 7 mutation operators,
that apply various random changes to solutions.
The goal of the WOLS system is to find good
approximations for an unknown function ƒ, and to
analyze the decomposition of good approximations
with the goal to identify building blocks.
The front end of the WOLS system takes a data
collection DC(A1,,An, B) as its input and tries
to find good approximations h, where
Bh(A1,,An), of the unknown function f with
respect to a given function set and a given error
function.
The symbolic regression genetic programming
component (SRGP) of the WOLS system includes an
interactive front end interface which allows a
user to easily generate a genetic program that
employs symbolic regression and is tailor made to
the problem he wishes to solve.

11
Features of the Wols System (cont.)

The system run starts by asking the user
interactively for parameters needed to run the
symbolic regression system. The information the
user provides is then used to generate the
functions for the specific problem he wishes to
solve. These problem specific functions are then
merged with a set of general genetic programming
functions to form a complete problem specific
genetic program.
The SRGP component allows the user to select
between three different error functions
Manhattan error function (minimizes the absolute
error in the output variable)
Least Square error function (minimizes the
squared error in the output variable)
Classification error function (minimizes the
number of misclassifications)
The WOLS system is capable of solving both
regression analysis and classification problems.
In regression analysis problems the goal is to
minimize the error, while classification problems
the goal is to minimize the number of
misclassifications.

12
Features of the Wols System (cont.)

The goal of the decision block analysis tool is
to identify frequently occurring patterns in sets
of solutions. In our particular approach,
building blocks are expressions that involve
functions symbols, variable symbols, as well as
the ? (dont care) symbol.
For example, if DC(A1, A2, A3, B) is a data
collection the variable A1, the expression (A2
A3), and expression ((sin A1) (A2 ?)) are all
potential building blocks.
In general we are interested in finding patterns
that occur in about 85 or more of the set of
analyzed approximations.
The identification of building blocks is useful
for both improving the attribute space and in
determining which part of the search space has
been explored.

13
Decision Block Analysis Functions

Root Analysis Finds common root patterns among a
population of solution trees.
General Analysis Finds common patterns anywhere
in a solution tree.
Leaf Analysis Finds common patterns at the leaf
nodes of solution trees.
Frequency Analysis Finds the total number of
times a specific operator,variable,
or constant is
used in the population of solution trees.

14
4. Experimental Results

Experiment 1- 5 variable problem with 30 points
training data
Experiment 2 - 5 variable problem with 150
points training data
Experiment 3 - Simple Linear Regression
Experiment 4 - Time for a Hot Object To Cool
Experiment 5 - Time for a Hot Object To Cool
Using Constructive Induction
Experiment 6 - Glass Classification
Experiment 7 - Learning An Approximation
Function

15
Experiment 7 - Learning An Approximation Function

f(a, b, c, d, e) b-d-0.44log(a1)
ab(1-log(2))
Random values were plugged into function f in
order to generate test cases of the form (a, b,
c, d, e, f(a, b, c, d, e))
training data set of 150 test cases, testing data
set size 150 test cases
Goal was to find a function h(a, b, c, d, e) that
best approximates function f.
function set , , loga(x) log(x),
cosa(x)cos(x), sqrta(x)sqrt(x),
por(a,b) ab-ab, diva(a/b) a/b,
aminus(a,b) a-b, expa(x) emin(x 300),
sin100a sin(100x)
variable set a, b, c, d, e

16
Experiment 7 - continued

Fitness Function Average Absolute Manhattan
Distance
where
n total number of test cases
xi output generated by the solution found
through genetic programming
yi output generated by the known function
(desired output)
RUN 1 - Direct Approach
RUN 2 - Using Leaf Building Blocks
RUN 3 - Using Leaf Building Blocks and Reduced
Operator and Variable Sets
For each run a population size of 200 was used.
Each run was terminated after 100 generations.

17
RUN 1 - Direct Approach

Average Absolute Manhattan Distance Error
Training 0.01102
Testing 0.01332
Leaf Building Blocks In The Top 10 Of The
Population
Pattern ( B A) occurred 25 times in 100.0 of
the solutions.
Pattern (AMINUS 0.41999999999999998 D) occurred
32 times in 95.0 of the solutions.
Pattern (EXPA (AMINUS 0.41999999999999998 D))
occurred 28 times in 95.0 of the solutions.
Frequency Analysis In The Top 10 Of The
Population
was used 9 times in 25.0 of population
aminus was used 56 times in 100.0 of population
was used 176 times in 100.0 of
population
/ was used 75 times in 100.0 of
population
por was used 111 times in 100.0 of
population
sqrta was used 63 times in 100.0 of
population

Frequency Analysis (continued)
sin100a was used 3 times in 10.0 of population
cosa was used 1 times in 5.0 of the population
loga was used 2 times in 5.0 of population
expa was used 33 times in 100.0 of
population
A was used 76 times in 100.0 of
population
B was used 132 times in 100.0 of
population
C was used 9 times in 35.0 of population
D was used 57 times in 100.0 of
population
E was used 21 times in 40.0 of
population
CONSTANT was used 152 times in 100.0 of
population
Leaf Building Blocks ( B A), (AMINUS
0.41999999999999998 D), and (EXPA (AMINUS
0.41999999999999998 D))
Since variables C and E are the only two
variables not present in 100 of the solutions
analyzed they will be eliminated in RUN 3 along
with the least used operator cosa. Also notice
that variables C and E were not actually present
in function f and were successfully filtered out.

19
RUN 2 - Using Leaf Building Blocks

Solution
( ADF1
(SQRTA ( (SQRTA (AMINUS (SQRTA (POR B
(SQRTA
(AMINUS
(
( B
(POR
0.93000000000000005
(AMINUS
( ADF1 ADF1) ADF1)))
(POR
(AMINUS
(POR
(SQRTA
(AMINUS
(
0.91000000000000003
(POR A
(EXPA
(AMINUS
(POR ADF2

Average Absolute Manhattan Distance Error
Training 0.00886
Testing 0.01056
In RUN 2 there is a 19.6 increase in
accuracy in training and a 20.7 improvement in
testing over RUN 1.

20
Best Solution Run3

Solution
(DIVA ADF1
(EXPA ( (LOGA ADF3)
(AMINUS ( (DIVA (POR
0.23999999999999999 A)
(POR B
(LOGA
(POR ADF1
(POR
(POR
(EXPA A)
(EXPA
(LOGA
(POR
(DIVA ADF3 B)
(AMINUS ADF3 B)))))
( (EXPA
0.95000000000000007)
A))))))
ADF2)
ADF1))))
Where
ADF1 ( B A)

21
RUN 3 - Using Leaf Building Blocks and Reduced
Operator and Variable Sets

Solution
(DIVA ADF1
(EXPA ( (LOGA ADF3)
(AMINUS ( (DIVA (POR
0.23999999999999999 A)
(POR B
(LOGA
(POR ADF1
(POR
(POR
(EXPA A)
(EXPA
(LOGA
(POR
(DIVA ADF3 B)
(AMINUS ADF3 B)))))
( (EXPA
0.95000000000000007)
A))))))
ADF2)
ADF1))))
Where

Average Absolute Manhattan Distance Error
Training 0.00825
Testing 0.00946
In RUN 3 there is a 7.0 increase in
accuracy in training and a 10.4 improvement in
testing over RUN 2 and a 25.1 increase in
accuracy in training and a 29.0 improvement in
testing over RUN 1.

22
5. Summary and Conclusion

The WOLS system is a tool whose objective is to
find good approximations for an unknown function
?, and to analyze the decomposition of good
approximations with the goal to identify building
blocks.
Our experiments demonstrated that significant
improvements in the predictive accuracy of the
function approximations obtained can be achieved
by employing building block analysis and
constructive induction. However, more experiments
are needed to demonstrate the usefulness of the
WOLS-system.
The symbolic regression/genetic programming
front-end (SRGP) did quite well for a number of
regression system benchmarks. This interesting
observation has to be investigated in more detail
in the future.
The WOLS system was written in GCL Common LISP
Version 2.2. All Experiments were run on an
Intel 266 MHZ Pentium II with 96 MB RAM.

23
Future Work

Improving the WOLS systems performance in the
area of classification problems.
Conduct more experiments that use the WOLS
system as a secondary learning system, whose goal
is to search for good building blocks useful for
improving the attribute space for other inductive
learning methods such as C4.5.
Construct a better user interface, such as a
graphical user interface.
Although the empirical results obtained from the
large number of experiments involving regression
were encouraging more work still needs to be done
in this area as well.

24
Experiment 3 - Simple Linear Regression

In this experiment the WOLS system, NLREG, and
Matlab were all used to obtain an approximation
function for a simple linear regression problem.
NLREG, like most conventional regression analysis
packages, is only capable of finding the numeric
coefficients for a function whose form (i.e.
linear, quadratic, or polynomial) has been
prespecified by the user. A poor choice, made by
the user, of the functions form will in most
cases lead to a very poor solution which would
not be truly representative of the programs
ability to solve the problem. In order to avoid
this problem, for this experiment both the data
set and the actual functional form, from one of
the example problems provided with this software,
were used.
In Matlab there are basically two different
approaches available to perform regression
analysis. The quickest method, which was used in
this experiment, is to use the command polyfit(x,
y, n), which instructs Matlab to fit an nth order
polynomial to the data. The second method is
similar to NLREG and requires the functional form
to be specified by the user.
training data set of 30 test cases testing data
set size 15 test cases
The following parameters were used for the WOLS
system
Population Size 500
Generations 200
function set , , loga(x) log(x),
cosa(x)cos(x), sqrta(x)sqrt(x),
por(a,b) ab-ab, diva(a/b) a/b,
aminus(a,b) a-b, expa(x) emin(x 300),
sin100a sin(100x)
variable set a

NLREG Solution
1 Title "Linear equation y p0 p1x"
2 Variables x,y // Two
variables x and y
3 Parameters p0,p1 // Two
parameters to be estimated p0 and p1
4 Function y p0 p1x // Model of
the function
Stopped due to Both parameter and relative
function convergence.
Proportion of variance explained (R2) 0.9708
(97.08)
Adjusted coefficient of multiple determination
(Ra2) 0.9698 (96.98)
---- Descriptive Statistics for
Variables ----
Variable Minimum value Maximum value Mean
value Standard dev.
---------- -------------- -------------- --------
------- -------------
x 1 8
4.538095 2.120468
y 6.3574 22.5456
14.40188 4.390238
---- Calculated Parameter
Values ----

Matlab Solution
1st order polynomial(n 1)
y 2.0400x 5.1442
Training 0.6361
Testing 0.7438
5th order polynomial(n 5)
y 0.0070x5 - 0.1580x4 1.3365x3 - 5.2375x2
11.5018x - 1.0632
Training 0.5838
Testing 0.8034
10th order polynomial(n 10)
y 0.0004x10 - 0.0187x9 0.3439x8 - 3.5856x7
23.2763x6- 97.5702x5 265.0925x4 -
456.1580x3 469.7099x2 - 254.6558x
59.9535
Training 0.4700
Testing 0.8669
In viewing the results above, one the first
things that is noticed is that as the order of
the polynomial increases, the training
performance also increases, but the testing
performance decreases. This is because the higher
order polynomials tend emulated the training set
data too closely which in return causes a poorer
performance when the testing data set values are
applied.

27
WOLS Solution

( (POR ( X X) 0.080000000000000002)
( ( (DIVA 0.95000000000000007 0.38)
( (DIVA (LOGA ( (POR (EXPA (POR
0.94000000000000006
(LOGA
(COSA
(
(DIVA
(
(EXPA
(DIVA
(
(EXPA
( 0.68000000000000005
0.13))
0.51000000000000001)
0.38))
0.51000000000000001)
0.38)
X)))))

Average Absolute Manhattan Distance
Training 0.3788
Testing 0.6906
Runtime wise the WOLS system cannot compete with
NLREG or Matlab each of which found a solution in
around 1 second, while the WOLS system spent
nearly 4 hours generating its solution.
The WOLS system produced a solution that had 40
better training performance and a 7 better
testing performance then both the NLREG and
Matlab solutions.
This solution was obtained without using building
block analysis and constructive induction.