Toward Uniformdistribution Monotone DNF Learning: A Proposed Algorithm - PowerPoint PPT Presentation

1 / 71

About This Presentation

Title:

Toward Uniformdistribution Monotone DNF Learning: A Proposed Algorithm

Description:

Parity function ... In the algorithm we build up the hypothesis by choosing a new parity in every loop. ... If some parity is not a subset of any term in the ... – PowerPoint PPT presentation

Number of Views:151

Avg rating:3.0/5.0

Slides: 72

Provided by: Mat6157

Category:

more less

Transcript and Presenter's Notes

Title: Toward Uniformdistribution Monotone DNF Learning: A Proposed Algorithm

1
Toward Uniform-distribution Monotone DNF
LearningA Proposed Algorithm

Presented by Wenzhu Bi
Department of Mathematics and Computer Science
Duquesne University
Advisor Dr. Jeffrey Jackson

2
Outline

Some definitions
Why we working on DNF learning
The proposed problem
Pervious work
Our work and the algorithm
Performed tests and analysis
Future work.

3
Some Definitions

AND
The Boolean function is true if and only if all
the literals are true. For example, p?q is true
if and only if p and q are both true.
OR
The Boolean function is false only if all the
literals are false. For example, p?q is false if
and only if p and q are both false.
Negation
If q is true, then ?q is false if q is false,
then ?q is true.

4
More Definitions

DNF(Disjunctive Normal Form)
An OR of ANDs. Conjunctions in a DNF are called
terms. The size of a DNF is defined as the
number of terms that it has.
For example x1x2x3x4x5. It is a 2-term DNF.
CNF(Conjunctive Normal Form)
An AND of ORs
For example (x1)(x2)(x3)(x4x5x6)
Monotone DNF
A DNF with no negated variables.
For example x1x2x3x4x5

5
Why we are interested in DNF Learning?
6
DNF is a natural means of representing many
"expert" rules

ifit is raining OR the forecast calls for
rainAND you trust weather forecastsORyou like
to always be preparedANDyou are taking a
briefcaseANDyour briefcase is not fullthen
take an umbrella when you go out.
7
More practical Areadigital circuit design and
implementation
8
The Proposed Problem

In 1984 Valiant introduced the distribution-indepe
ndent model of Probably Approximately Correct
(PAC) learning from random examples and brought
up the problem of whether polynomial-size DNF
functions are PAC learnable in polynomial time.
It has been about twenty years that the DNF
learning problem has been widely regarded as one
of the most important ---and challenging --- open
questions in Computational Learning Theory.

9
General DNF Learning

It has been shown by Dr. Jackson that general
DNF is strongly learnable with respect to the
uniform distribution using membership queries.

10
Previous Work

Because of the lack of progress on learning
monotone DNF in the distribution-independent
setting, instead of approaching this result
directly, many researchers study the restricted
versions, such as learning monotone DNF with
uniformly-distributed examples.
Notice that the problem is restricted in two
ways
1. General DNF ? Monotone DNF
2. Arbitrary Distribution ? Uniform

11
Previous Work(Cont.)

Hancock and Mansour gave a polynomial time
algorithm for learning monotone read-k DNF (DNF
in which every variable appears at most k times
in which k is a constant) under a family of
distributions(including uniform distribution).
Then there is a series of research work, done by
Verbeurgt, Kucera, Sakai, Maruoka, Bshouty and
Tamon, to improve the monotone DNF learning with
uniform distribution.
The latest best work is done by Servedio to prove
that that the class of monotone -term
DNF formulae can be PAC learned in polynomial
time under the uniform distribution from random
examples only.

12
Our Work

Our algorithm attempts to learn a monotone DNF
f using a hypothesis h (h is not necessarily
monotone DNF) in polynomial time from a set of
uniformly-distributed samples generated by f.

13
The Algorithm

Input
Target function f
f is a monotone DNF function.
Output
A hypothesis h such that Prf?h ?.
In our particular case, h is a threshold
function hsignF.
F is a sum of parity functions.

14
Definitions in the Algorithm

Threshold Function
h signF
h is 1 if Fgt0 while h is -1 if Flt0.

15
More Definitions in the Algorithm

Parity function
The function has a value of 1 when the parity
of the number of 1s in x indexed by a is even
-1 when the parity of the number of 1s in x
indexed by a is odd.

Xn X3 X2 X1
a

The value of the parity
16
The Hypothesis

h signF
in which,
S is a polynomial-size set of bitvectors
is an integer called Fourier
Coefficient of F.

17
The Algorithm

Initialize F 1, then h signF 1. Check if
Prf?h ?. If yes, the algorithm stops
otherwise, continue to next step.
2. Find the neighbors of the current parities or
add up the weights for existing parities to
produce the potential hypotheses.

18
How to choose the best hypothesis

If h signF is a perfect hypothesis(i.e.
for all x, h (x) f(x)), then
In the algorithm we build up the hypothesis
by choosing a new parity in every loop. In each
stage, we select the hypothesis with the new
parity which minimizes the difference between the
two sides of the above equation.

19
The Algorithm Continued

Step 2 continued
Then pick up the hypothesis which minimizes the
difference. Then check if Prf?h ?. If yes,
the algorithm stops otherwise, repeat Step 2.

20
An example

f x1x2x3

21
FFT(Butterfly)-Fast Fourier Transforms
22

Each Fourier coefficient requires time
to compute exactly. We can compute all of the
coefficients in time using the FFT.
In our tests, we perform this exact
computation so that we can leave out the
variability added by estimating the Fourier
coefficients. In an actual implementation, these
coefficients would be estimated using
polynomial-size set of examples
(sufficient for PAC learning by Chernoff
bound).

23
Step 1 F1

Then h signF sign1 1.
And S000, 1, and -0.75.
Check all the examples over the truth
table,
Prf?h 0.875
which is greater than ? (suppose ? is
0.05). So we continue to step 2.

24
Step 2 loop 1

Current hypothesis h signF and F1 With S
.
Then Neighbors of are 001, 010, and 100.
So the potential hypotheses will be
F 3
F 1 - 2?001(x)
F 1 - 2?010(x)
F 1 - 2?100(x)

25
the hypothesis pool
26
Choose the hypothesis h signF sign1
- 2?001(x)

With S000,001.
Check all the examples over the truth
table,
Prf?h 0.375
which is greater than ? (suppose ? is
0.05). So we repeat step 2.

27
Step 2 loop 2

Current hypothesis
h signF sign1 - 2?001(x) with S000,001
Then Neighbors of 000 are 001, 010, and 100
Neighbors of 001 are 000, 011, and 101.
Since 000 and 001 are already in the current
hypothesis, so the new neighbors pool will not
include these two parities.
Then the current neighbors pool temporarily is
010,100,011, 101.

28
Step 2 loop 2(cont.)

Here we have another rule added in
For any parity x(except the constant parity) in
F, if x is a subset of x,
then the weight of x must be greater than or
equal to the weight of x.
It also means that a new parity can be added in
the new pool only if all
its immediate down neighbors are already in the
pool.
So the new neighbors pool S will be 010,100.
Then the potential
hypotheses will be
F 3 - 2?001(x)
F 1 - 4?001(x)
F 1 - 2?001(x)- 2?010(x)
F 1 - 2?001(x)- 2?100(x)

29
the hypothesis pool
30
Choose the hypothesis hsignF
sign1-2?001(x)-2?010(x)

With S000,001,010.
Check all the examples over the truth
table,
Prf?h 0.375
which is greater than ? (suppose ? is
0.05). So we repeat step 2.

31
Step 2 loop 3

Current hypothesis
h signF sign1 - 2?001(x) - 2?010(x) with
S000,001,010
Then Neighbors of 000 are 001, 010, and 100
Neighbors of 001 are 000, 011, and 101
Neighbors of 010 are 000, 011, and 110.
Since 000, 001 and 010 are already in the current
hypothesis, so the new neighbors pool will not
include these three parities.
Then the current neighbors pool temporarily is
100,011,101,110.

32
Step 2 loop 2(cont.)

By the subsets rule,
So the new neighbors pool S will be 100,011.
Then
the potential hypotheses will be
F 3 - 2?001(x) - 2?010(x)
F 1 - 4?001(x) - 2?010(x)
F 1 - 2?001(x) - 4?010(x)
F 1 - 2?001(x) - 2?010(x) 2 ?100(x)
F 1 - 2?001(x) - 2?010(x) 2 ?011(x)

33
the hypothesis pool
34
Choose the hypothesis h
sign1-2?001(x)-?010(x) 2 ?011(x)

With S000,001,010,011.
Check all the examples over the truth
table,
Prf?h 0.125
which is greater than ? (suppose ? is
0.05). So we repeat step 2.

35
Then repeat Step 2

Finally in Loop 7, we find the hypothesis
h sign1-2?001(x)-2?010(x) 2 ?011(x)
- 2?100(x) 2?101(x) 2 ?110(x)-
2?111(x).
Check all the examples over the truth table,
Prf?h 0
which is less than ? (suppose ? is 0.05).
So the algorithm ends finding the final best
hypothesis h.

36
Performed Tests and Analysis
37
How to analyze the results

1. Each parity in the hypothesis should be a
subset of the terms in the target function. If
some parity is not a subset of any term in the
target function, this will be a sign to break the
algorithm otherwise, it is working as we expect.

38
How to analyze the results(cont.)

2.Check if the error values keep decreasing.
The error values may bounce up and down. But
the minimum error values are expected to be
decreasing. When the minimum error is smaller
than , the algorithm stops and finds the final
best hypothesis.

39
Case 1 Simple Test Cases

f x1x2x3
f x1x2 x3
f x1x2 x3x4
f x1x2 x3x5 x4x1 x2x3
f x1x2x3x4 x5x6x7x8
f x1x2x3 x4x5x6 x7x8x9
f x1x2x3 x4x5x6 x7x8x9 x10x11x12
f x1x2x3 x4x5x6 x7x8x9 x10x11x12
x13x14x15
f x1x2x3x4 x6x5x7x8 x9x10x11x12
x13x14x15x16
f x15x2x3x4 x6x5x7x8 x9x10x11x12
x13x14x1x16
f x1x2x3x4 x5x6x7x8 x9x10x11x12 x13x14x15
x16x17x18
f x1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x
17x18x19
f x1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x
17x18x19x20

40
f x1x2x3x4 x6x5x7x8 x9x10x11x12
x13x14x15x16
41
Case 2 Read-once monotone DNF

Read-once monotone DNF has u terms
and each term has log(u) distinct variables.
We tested
x1x2 x3x4 x5x7 x6x8
The algorithm found the best hypothesis easily.

42
Case 3 Random monotone DNF

Drs. Jackson and Servedio recently gave the
proof that the n-terms random monotone DNF with
log(n) literals in each term is learnable in
polynomial time from uniform random examples with
high probability.
We tested cases with n2-terms random monotone
DNF with 2log(n) literals in each term because
Jackson and Servedios result does not extend to
this case. We tried 2 cases with n8 and n16.

43
Case 3 Results

When n8, f has 8 variables, 64 terms and 6
literals in each
term. We tested with the following function.
x1x6x3x5x7x2x8x6x7x4x1x5x4x2x8x6x5x7x6x4x5x3x7
x1
x4x1x2x3x5x8x8x3x5x6x7x2x1x3x6x5x2x8x1x4x8x6x7
x3
x4x7x5x2x6x1x3x2x6x1x8x5x8x2x3x5x1x4x2x8x7x4x3
x1
x4x2x6x5x8x1x6x1x8x7x2x3x1x4x6x3x8x7x2x1x4x3x7
x5
x8x3x5x2x7x4x1x5x3x4x7x8x3x1x4x5x8x2x1x3x4x2x7
x5
x3x6x5x4x8x2x3x5x2x6x1x7x3x7x5x2x6x8x5x6x7x3x2
x8
x2x6x3x5x8x1x1x4x5x2x3x8x1x7x4x2x6x3x2x6x1x4x8
x7
x8x7x3x5x4x1x5x4x2x1x3x8x1x8x6x7x2x5x6x3x2x5x8
x1
x7x1x3x8x6x5x8x3x6x5x2x1x3x6x7x8x4x1x7x5x1x2x3
x4
x8x3x1x5x6x4x8x5x1x4x7x3x5x7x2x4x6x8x3x4x5x8x2
x7
x2x4x6x3x5x7x5x6x8x3x4x7x4x7x2x8x3x5x5x8x1x2x4
x6
x4x8x5x3x6x1x6x5x2x8x3x4x8x5x2x3x4x7x6x7x3x4x5
x1
x5x4x1x2x7x8x8x5x1x6x3x2x5x6x3x1x4x8x4x6x1x3x5
x7
x4x1x6x2x5x8x8x4x1x6x5x3x8x3x4x5x7x6x2x4x8x3x1
x5
x4x8x7x2x1x3x4x2x3x8x6x7x3x7x5x4x2x6x3x2x6x1x7
x4
x3x7x4x1x5x6 x2x6x7x5x8x4 x4x5x3x2x8x6
x6x8x2x7x5x3

44
The hypothesis hsignF
45
Case 3 Results

When n16, f has 16 variables, 256 terms and
8 literals in each term. We tested with the
following function.
This case was not finished. The result so far
we got is that the minimum error is 0.111252 in
Loop 2349 with a difference value 6.38239. Before
that, the minimum error stayed 0.112366 from Loop
1363 to Loop 2238 and then it was changed to
0.112183, then to 0.111252. Since the difference
value is still small and the minimum value is
still getting smaller, it might be reasonable to
say that for this case it may be finished with
expected results. (This case had run for several
days and then was stopped because the computer
was restarted by other people. We will try to
test this case again if we have time.)

46
Case 4 Monotone DNF with significant sharing
of variables between terms

We have tested a case like this Whenever x1
is selected for a term, x2 and x3 will also be in
the term with probability 0.5 each.
That is, whenever x1 is selected for a term
With a probability 0.25 that both x2 and x3
will be in the term
With a probability 0.25 that only x2 will be
in the term
With a probability 0.25 that only x3 will be
in the term
With a probability 0.25 that neither x2 nor
x3 will be in the term.

47
Case 4 Cont.

This means that x2 and x3 have the equal
probability appearing
in the function. This might confuse the
algorithm when facing
the problem whether x2 or x3 should be chosen.
We tested the case with 2 examples
(a) One example has 8 variables, 64 terms and 6
literals in each term
(b) The other example has 10 variables, 64 terms
and 6 literals in each term.
These two cases were all performed well to get
the expected
results.

48
Case 5

Dr. Servedio recommended this case
(x1x2 x3x4)(x5x6 x7x8)
(x9x10 x11x12)(x13x14 x15x16)
which is an AND-OR tree, i.e. a binary tree
where odd layers all have AND and even layers all
have OR.

49
Case 5 Results

If it is converted to a Monotone DNF
function, it is as the following
x1x2x5x6 x1x2x7x8 x3x4x5x6 x3x4x7x8
x9x10x13x14 x9x10x15x16 x11x12x13x14
x11x12x15x16

50
Case 5 Results

For this case the algorithm didnt finish
with the expected results.
The result is the minimum error was
0.134216, which began from Loop 135. And this
minimum error stayed 0.134216 until Loop 2681.
The difference values are not getting large,
staying around 1 and 2. For Loop 2681, the
difference is 1.28302.
Since it cost so many loops without reaching
a smaller error value and some parity functions
are not subsets of the terms in the target
function, it appears that the algorithm
will not work out for this case.

51
Case 5 Analysis

Since a DNF expression can also be represented
by a CNF, it is not guaranteed that the algorithm
will find the terms in the DNF instead of the
clauses in the CNF. It was expected that this
case of monotone DNF may be hard to learn since
the algorithm may find the parities which are the
clauses in the CNF.

52
Case 6 Monotone DNF with dependent terms

1st term x1x7x4
2nd term xaxbxc.
3rd term xdxexf
Suppose ?0.3

In 2nd term, a is 1 with a prob 0.3 a is chosen
randomly from other remaining variables with a
prob 0.7
Same rule for b, c, d,e.

53
Case 6 Cont.

Assume there is a variable ? (0?1), which is
used to indicate how dependent between terms in
the target function.
The larger the value of ?, the terms are more
dependent.

54
Case 6 Cont.

For this case, the value of ? can be changed
continuously from 0 to 1.
1. ? 0, all the terms are generated
independent of each other
2. 0lt?lt1, the terms in the target function
are dependent with a coefficient ?.

55
Why we test this case?

Since the value of ? can be changed
continuously from 0 to 1, then we can get a
series of functions with different dependent
degree between terms.
If the series of functions can be learned, it
means that the algorithm works no matter the
terms are dependent or independent. This result
will be very encouraging.

56
Note

To get the valuable test results, we need to
choose the test cases. We prefer to choose the
target function f with a probability of about
0.5 to be 1 and a probability of about 0.5 to be
-1. If a target function satisfies with the
probabilities requirements, it is balanced
otherwise, it is biased.
Suppose the variable ? is the probability of
the target function to be -1, then the more
skewed the target function, the larger the value
is ? away from 0.5.
When the function is more balanced, it is harder
for the algorithm to learn.

57
Case 6 Results

Then we found a series of test cases each
function
has 16 variables, 20 terms and 4 literals in each
term. And
(a) Case 1. When ? 0, ? 0.465515
(b) Case 2. When ? 0.25, ? 0.462097
(c) Case 3. When ? 0.50, ? 0.516663
(d) Case 4. When ? 0.75, ? 0.648193
(e) Case 5. When ? 0.95, ? 0.875.

58
Case 6 Results(Cont.)

The algorithm performs very well in finding the
hypothesis on Case
1, Case 4 and Case 5 with ? 0.05.
But it seems that it has problem finding the
hypothesis with Case 2 and Case 3.

59
Case 6 Results(Cont.)

The following is the data we got in the
results
(a) For alpha0.25, from Loop 82 to Loop 5451,
the minimum error stays 0.269104 The difference
values are keeping getting larger and it becomes
409.171 in Loop 5451.
(b) For alpha0.5, from Loop 51 to Loop 7046, the
minimum error stays 0.209656 The difference
values are keeping getting larger and it becomes
360.967 in Loop 7046.

60
Case 6 Results Another series of functions

Another series of the similar target functions
are each function has
16 variables, 256 terms and 6 literals in each
term.
(a) Case 1. When ? 0, ? 0.45665
(b) Case 2. When ? 0.1, ? 0.454147
(c) Case 3. When ? 0.2, ? 0.457825
(d) Case 4. When ? 0.3, ? 0.45816
(e) Case 5. When ? 0.4, ? 0.460587
(f) Case 6. When ? 0.5, ? 0.460587
(g) Case 7. When ? 0.6, ? 0.472488
(h) Case 8. When ? 0.7, ? 0.497726
(i) Case 9. When ? 0.8, ? 0.520004
(j) Case 10. When ? 0.85, ? 0.551468
(k) Case 11. When ? 0.9, ? 0.62793
(l) Case 12. When ? 0.95, ? 0.719345.

61
Case 6 Results Another series of functions

(a) Case 1. When ? 0, ? 0.45665 The
algorithm reaches a minimum error 0.144882 in
Loop 33 and keeps this minimum error until Loop
5185. From the result file, we can see that there
is no new parities added in the hypothesis after
Loop 47. Without adding in new parities, the
weights of the old parities are added
up.

62
Case a (cont.)

In Loop 5185, the selected hypothesis is

63
Case a (cont.)

The difference value for this picked
hypothesis is 99.7091.
For this case, since ? 0, the terms are
supposed to be independent of each other. That
is, it is a random monotone DNF function. But
given the above results, it seems that it may
fail to find the expected hypothesis. As now, we
suppose that although we thought it was random
Monotone DNF, the terms in the function might be
somehow dependent each other. We still need to do
some analysis on this case.

64
Case b Results

(b) Case 2. When ? 0.1, ? 0.454147 The
minimum error is
0.129028 until Loop 9409. Since some files were
deleted, I can
not find in which loop this minimum error value
was reached. It
was known from my notes that it was already
reached by Loop
6866. In Loop 8117, the picked hypothesis has the
difference
value 38.94 and in Loop 9409, it has the
difference value
44.5602. It is also known that after Loop 320,
there is no new
parity added in the hypothesis.

65
Case c-h Results

Similar as Case b.

66
Only Case l - Finished

When ? 0.95, ? 0.719345. The algorithm
works well on this case. It reaches the minimum
error 0.053772 in Loop 2357 with the difference
value 2.09869.

67
Case 6 Analysis

We begin to think that maybe the current
algorithm does not work correctly for some of
monotone DNF functions with dependent terms.
We will talk later in the section Future
Work about some changes to the current algorithm
to make it work for these cases.

68
Future Work

Change the coefficient of the constant parity?
When there are ties in the pool
a. try to choose randomly from the ties
b. try to choose the hypothesis with smallest
weight after the parity is added in the
hypothesis or the weight of the current parity is
added up.

69
Summary

1.We have some cases, some of which are huge
functions, that the algorithm performs very well.
These results are encouraging.
2.We have problem with the case Recommended by
Dr. Servedio and the monotone DNFs with
dependent terms. We still need to make some
changes to the algorithm to see if the algorithm
will work on these cases.

70
Acknowledgements