Title: Toward Uniformdistribution Monotone DNF Learning: A Proposed Algorithm
1Toward Uniform-distribution Monotone DNF
LearningA Proposed Algorithm
- Presented by Wenzhu Bi
- Department of Mathematics and Computer Science
- Duquesne University
- Advisor Dr. Jeffrey Jackson
2Outline
- Some definitions
- Why we working on DNF learning
- The proposed problem
- Pervious work
- Our work and the algorithm
- Performed tests and analysis
- Future work.
3Some Definitions
- AND
- The Boolean function is true if and only if all
the literals are true. For example, p?q is true
if and only if p and q are both true. - OR
- The Boolean function is false only if all the
literals are false. For example, p?q is false if
and only if p and q are both false. - Negation
- If q is true, then ?q is false if q is false,
then ?q is true. -
4More Definitions
- DNF(Disjunctive Normal Form)
- An OR of ANDs. Conjunctions in a DNF are called
terms. The size of a DNF is defined as the
number of terms that it has. - For example x1x2x3x4x5. It is a 2-term DNF.
- CNF(Conjunctive Normal Form)
- An AND of ORs
- For example (x1)(x2)(x3)(x4x5x6)
- Monotone DNF
- A DNF with no negated variables.
- For example x1x2x3x4x5
5Why we are interested in DNF Learning?
6DNF is a natural means of representing many
"expert" rules
ifit is raining OR the forecast calls for
rainAND you trust weather forecastsORyou like
to always be preparedANDyou are taking a
briefcaseANDyour briefcase is not fullthen
take an umbrella when you go out.
7More practical Areadigital circuit design and
implementation
8The Proposed Problem
- In 1984 Valiant introduced the distribution-indepe
ndent model of Probably Approximately Correct
(PAC) learning from random examples and brought
up the problem of whether polynomial-size DNF
functions are PAC learnable in polynomial time.
It has been about twenty years that the DNF
learning problem has been widely regarded as one
of the most important ---and challenging --- open
questions in Computational Learning Theory.
9General DNF Learning
-
- It has been shown by Dr. Jackson that general
DNF is strongly learnable with respect to the
uniform distribution using membership queries.
10Previous Work
- Because of the lack of progress on learning
monotone DNF in the distribution-independent
setting, instead of approaching this result
directly, many researchers study the restricted
versions, such as learning monotone DNF with
uniformly-distributed examples. - Notice that the problem is restricted in two
ways - 1. General DNF ? Monotone DNF
- 2. Arbitrary Distribution ? Uniform
11Previous Work(Cont.)
- Hancock and Mansour gave a polynomial time
algorithm for learning monotone read-k DNF (DNF
in which every variable appears at most k times
in which k is a constant) under a family of
distributions(including uniform distribution). - Then there is a series of research work, done by
Verbeurgt, Kucera, Sakai, Maruoka, Bshouty and
Tamon, to improve the monotone DNF learning with
uniform distribution. - The latest best work is done by Servedio to prove
that that the class of monotone -term
DNF formulae can be PAC learned in polynomial
time under the uniform distribution from random
examples only.
12Our Work
- Our algorithm attempts to learn a monotone DNF
f using a hypothesis h (h is not necessarily
monotone DNF) in polynomial time from a set of
uniformly-distributed samples generated by f.
13The Algorithm
- Input
- Target function f
- f is a monotone DNF function.
- Output
- A hypothesis h such that Prf?h ?.
- In our particular case, h is a threshold
function hsignF. - F is a sum of parity functions.
14Definitions in the Algorithm
- Threshold Function
- h signF
- h is 1 if Fgt0 while h is -1 if Flt0.
15More Definitions in the Algorithm
- Parity function
- The function has a value of 1 when the parity
of the number of 1s in x indexed by a is even
-1 when the parity of the number of 1s in x
indexed by a is odd.
The value of the parity
16The Hypothesis
- h signF
- in which,
- S is a polynomial-size set of bitvectors
- is an integer called Fourier
Coefficient of F.
17The Algorithm
- Initialize F 1, then h signF 1. Check if
Prf?h ?. If yes, the algorithm stops
otherwise, continue to next step. - 2. Find the neighbors of the current parities or
add up the weights for existing parities to
produce the potential hypotheses.
18How to choose the best hypothesis
-
- If h signF is a perfect hypothesis(i.e.
for all x, h (x) f(x)), then -
- In the algorithm we build up the hypothesis
by choosing a new parity in every loop. In each
stage, we select the hypothesis with the new
parity which minimizes the difference between the
two sides of the above equation.
19The Algorithm Continued
- Step 2 continued
- Then pick up the hypothesis which minimizes the
difference. Then check if Prf?h ?. If yes,
the algorithm stops otherwise, repeat Step 2.
20An example
21FFT(Butterfly)-Fast Fourier Transforms
22- Each Fourier coefficient requires time
- to compute exactly. We can compute all of the
coefficients in time using the FFT. -
- In our tests, we perform this exact
computation so that we can leave out the
variability added by estimating the Fourier
coefficients. In an actual implementation, these
coefficients would be estimated using
polynomial-size set of examples - (sufficient for PAC learning by Chernoff
bound).
23Step 1 F1
- Then h signF sign1 1.
- And S000, 1, and -0.75.
- Check all the examples over the truth
- table,
- Prf?h 0.875
- which is greater than ? (suppose ? is
- 0.05). So we continue to step 2.
24Step 2 loop 1
- Current hypothesis h signF and F1 With S
. - Then Neighbors of are 001, 010, and 100.
- So the potential hypotheses will be
- F 3
- F 1 - 2?001(x)
- F 1 - 2?010(x)
- F 1 - 2?100(x)
25the hypothesis pool
26 Choose the hypothesis h signF sign1
- 2?001(x)
- With S000,001.
- Check all the examples over the truth
- table,
- Prf?h 0.375
- which is greater than ? (suppose ? is
- 0.05). So we repeat step 2.
27Step 2 loop 2
- Current hypothesis
- h signF sign1 - 2?001(x) with S000,001
- Then Neighbors of 000 are 001, 010, and 100
- Neighbors of 001 are 000, 011, and 101.
- Since 000 and 001 are already in the current
- hypothesis, so the new neighbors pool will not
- include these two parities.
- Then the current neighbors pool temporarily is
- 010,100,011, 101.
28Step 2 loop 2(cont.)
- Here we have another rule added in
- For any parity x(except the constant parity) in
F, if x is a subset of x, - then the weight of x must be greater than or
equal to the weight of x. - It also means that a new parity can be added in
the new pool only if all - its immediate down neighbors are already in the
pool. - So the new neighbors pool S will be 010,100.
Then the potential - hypotheses will be
- F 3 - 2?001(x)
- F 1 - 4?001(x)
- F 1 - 2?001(x)- 2?010(x)
- F 1 - 2?001(x)- 2?100(x)
29the hypothesis pool
30 Choose the hypothesis hsignF
sign1-2?001(x)-2?010(x)
- With S000,001,010.
- Check all the examples over the truth
- table,
- Prf?h 0.375
- which is greater than ? (suppose ? is
- 0.05). So we repeat step 2.
31Step 2 loop 3
- Current hypothesis
- h signF sign1 - 2?001(x) - 2?010(x) with
S000,001,010 - Then Neighbors of 000 are 001, 010, and 100
- Neighbors of 001 are 000, 011, and 101
- Neighbors of 010 are 000, 011, and 110.
- Since 000, 001 and 010 are already in the current
- hypothesis, so the new neighbors pool will not
- include these three parities.
- Then the current neighbors pool temporarily is
- 100,011,101,110.
32Step 2 loop 2(cont.)
- By the subsets rule,
- So the new neighbors pool S will be 100,011.
Then - the potential hypotheses will be
- F 3 - 2?001(x) - 2?010(x)
- F 1 - 4?001(x) - 2?010(x)
- F 1 - 2?001(x) - 4?010(x)
- F 1 - 2?001(x) - 2?010(x) 2 ?100(x)
- F 1 - 2?001(x) - 2?010(x) 2 ?011(x)
33the hypothesis pool
34 Choose the hypothesis h
sign1-2?001(x)-?010(x) 2 ?011(x)
- With S000,001,010,011.
- Check all the examples over the truth
- table,
- Prf?h 0.125
- which is greater than ? (suppose ? is
- 0.05). So we repeat step 2.
35Then repeat Step 2
- Finally in Loop 7, we find the hypothesis
- h sign1-2?001(x)-2?010(x) 2 ?011(x)
- - 2?100(x) 2?101(x) 2 ?110(x)-
2?111(x). - Check all the examples over the truth table,
- Prf?h 0
- which is less than ? (suppose ? is 0.05).
- So the algorithm ends finding the final best
- hypothesis h.
36Performed Tests and Analysis
37How to analyze the results
- 1. Each parity in the hypothesis should be a
subset of the terms in the target function. If
some parity is not a subset of any term in the
target function, this will be a sign to break the
algorithm otherwise, it is working as we expect.
38How to analyze the results(cont.)
-
- 2.Check if the error values keep decreasing.
- The error values may bounce up and down. But
the minimum error values are expected to be
decreasing. When the minimum error is smaller
than , the algorithm stops and finds the final
best hypothesis.
39 Case 1 Simple Test Cases
- f x1x2x3
- f x1x2 x3
- f x1x2 x3x4
- f x1x2 x3x5 x4x1 x2x3
- f x1x2x3x4 x5x6x7x8
- f x1x2x3 x4x5x6 x7x8x9
- f x1x2x3 x4x5x6 x7x8x9 x10x11x12
- f x1x2x3 x4x5x6 x7x8x9 x10x11x12
x13x14x15 - f x1x2x3x4 x6x5x7x8 x9x10x11x12
x13x14x15x16 - f x15x2x3x4 x6x5x7x8 x9x10x11x12
x13x14x1x16 - f x1x2x3x4 x5x6x7x8 x9x10x11x12 x13x14x15
x16x17x18 - f x1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x
17x18x19 - f x1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x
17x18x19x20
40 f x1x2x3x4 x6x5x7x8 x9x10x11x12
x13x14x15x16
41Case 2 Read-once monotone DNF
- Read-once monotone DNF has u terms
- and each term has log(u) distinct variables.
- We tested
- x1x2 x3x4 x5x7 x6x8
- The algorithm found the best hypothesis easily.
42Case 3 Random monotone DNF
- Drs. Jackson and Servedio recently gave the
proof that the n-terms random monotone DNF with
log(n) literals in each term is learnable in
polynomial time from uniform random examples with
high probability. -
- We tested cases with n2-terms random monotone
DNF with 2log(n) literals in each term because
Jackson and Servedios result does not extend to
this case. We tried 2 cases with n8 and n16.
43Case 3 Results
- When n8, f has 8 variables, 64 terms and 6
literals in each - term. We tested with the following function.
- x1x6x3x5x7x2x8x6x7x4x1x5x4x2x8x6x5x7x6x4x5x3x7
x1 - x4x1x2x3x5x8x8x3x5x6x7x2x1x3x6x5x2x8x1x4x8x6x7
x3 - x4x7x5x2x6x1x3x2x6x1x8x5x8x2x3x5x1x4x2x8x7x4x3
x1 - x4x2x6x5x8x1x6x1x8x7x2x3x1x4x6x3x8x7x2x1x4x3x7
x5 - x8x3x5x2x7x4x1x5x3x4x7x8x3x1x4x5x8x2x1x3x4x2x7
x5 - x3x6x5x4x8x2x3x5x2x6x1x7x3x7x5x2x6x8x5x6x7x3x2
x8 - x2x6x3x5x8x1x1x4x5x2x3x8x1x7x4x2x6x3x2x6x1x4x8
x7 - x8x7x3x5x4x1x5x4x2x1x3x8x1x8x6x7x2x5x6x3x2x5x8
x1 - x7x1x3x8x6x5x8x3x6x5x2x1x3x6x7x8x4x1x7x5x1x2x3
x4 - x8x3x1x5x6x4x8x5x1x4x7x3x5x7x2x4x6x8x3x4x5x8x2
x7 - x2x4x6x3x5x7x5x6x8x3x4x7x4x7x2x8x3x5x5x8x1x2x4
x6 - x4x8x5x3x6x1x6x5x2x8x3x4x8x5x2x3x4x7x6x7x3x4x5
x1 - x5x4x1x2x7x8x8x5x1x6x3x2x5x6x3x1x4x8x4x6x1x3x5
x7 - x4x1x6x2x5x8x8x4x1x6x5x3x8x3x4x5x7x6x2x4x8x3x1
x5 - x4x8x7x2x1x3x4x2x3x8x6x7x3x7x5x4x2x6x3x2x6x1x7
x4 - x3x7x4x1x5x6 x2x6x7x5x8x4 x4x5x3x2x8x6
x6x8x2x7x5x3
44The hypothesis hsignF
45Case 3 Results
- When n16, f has 16 variables, 256 terms and
8 literals in each term. We tested with the
following function. - This case was not finished. The result so far
we got is that the minimum error is 0.111252 in
Loop 2349 with a difference value 6.38239. Before
that, the minimum error stayed 0.112366 from Loop
1363 to Loop 2238 and then it was changed to
0.112183, then to 0.111252. Since the difference
value is still small and the minimum value is
still getting smaller, it might be reasonable to
say that for this case it may be finished with
expected results. (This case had run for several
days and then was stopped because the computer
was restarted by other people. We will try to
test this case again if we have time.)
46Case 4 Monotone DNF with significant sharing
of variables between terms
- We have tested a case like this Whenever x1
is selected for a term, x2 and x3 will also be in
the term with probability 0.5 each. - That is, whenever x1 is selected for a term
- With a probability 0.25 that both x2 and x3
will be in the term - With a probability 0.25 that only x2 will be
in the term - With a probability 0.25 that only x3 will be
in the term - With a probability 0.25 that neither x2 nor
x3 will be in the term. -
47Case 4 Cont.
- This means that x2 and x3 have the equal
probability appearing - in the function. This might confuse the
algorithm when facing - the problem whether x2 or x3 should be chosen.
- We tested the case with 2 examples
- (a) One example has 8 variables, 64 terms and 6
literals in each term - (b) The other example has 10 variables, 64 terms
and 6 literals in each term. - These two cases were all performed well to get
the expected - results.
48Case 5
- Dr. Servedio recommended this case
- (x1x2 x3x4)(x5x6 x7x8)
- (x9x10 x11x12)(x13x14 x15x16)
- which is an AND-OR tree, i.e. a binary tree
where odd layers all have AND and even layers all
have OR.
49Case 5 Results
- If it is converted to a Monotone DNF
- function, it is as the following
- x1x2x5x6 x1x2x7x8 x3x4x5x6 x3x4x7x8
x9x10x13x14 x9x10x15x16 x11x12x13x14
x11x12x15x16
50Case 5 Results
- For this case the algorithm didnt finish
with the expected results. - The result is the minimum error was
0.134216, which began from Loop 135. And this
minimum error stayed 0.134216 until Loop 2681.
The difference values are not getting large,
staying around 1 and 2. For Loop 2681, the
difference is 1.28302. - Since it cost so many loops without reaching
a smaller error value and some parity functions
are not subsets of the terms in the target
function, it appears that the algorithm - will not work out for this case.
51Case 5 Analysis
- Since a DNF expression can also be represented
by a CNF, it is not guaranteed that the algorithm
will find the terms in the DNF instead of the
clauses in the CNF. It was expected that this
case of monotone DNF may be hard to learn since
the algorithm may find the parities which are the
clauses in the CNF.
52Case 6 Monotone DNF with dependent terms
- 1st term x1x7x4
- 2nd term xaxbxc.
- 3rd term xdxexf
- Suppose ?0.3
- In 2nd term, a is 1 with a prob 0.3 a is chosen
randomly from other remaining variables with a
prob 0.7 - Same rule for b, c, d,e.
53Case 6 Cont.
- Assume there is a variable ? (0?1), which is
used to indicate how dependent between terms in
the target function. - The larger the value of ?, the terms are more
dependent.
54Case 6 Cont.
- For this case, the value of ? can be changed
continuously from 0 to 1. - 1. ? 0, all the terms are generated
independent of each other - 2. 0lt?lt1, the terms in the target function
are dependent with a coefficient ?.
55Why we test this case?
- Since the value of ? can be changed
continuously from 0 to 1, then we can get a
series of functions with different dependent
degree between terms. - If the series of functions can be learned, it
means that the algorithm works no matter the
terms are dependent or independent. This result
will be very encouraging.
56Note
- To get the valuable test results, we need to
choose the test cases. We prefer to choose the
target function f with a probability of about
0.5 to be 1 and a probability of about 0.5 to be
-1. If a target function satisfies with the
probabilities requirements, it is balanced
otherwise, it is biased. - Suppose the variable ? is the probability of
the target function to be -1, then the more
skewed the target function, the larger the value
is ? away from 0.5. - When the function is more balanced, it is harder
for the algorithm to learn.
57Case 6 Results
- Then we found a series of test cases each
function - has 16 variables, 20 terms and 4 literals in each
- term. And
- (a) Case 1. When ? 0, ? 0.465515
- (b) Case 2. When ? 0.25, ? 0.462097
- (c) Case 3. When ? 0.50, ? 0.516663
- (d) Case 4. When ? 0.75, ? 0.648193
- (e) Case 5. When ? 0.95, ? 0.875.
58Case 6 Results(Cont.)
- The algorithm performs very well in finding the
hypothesis on Case - 1, Case 4 and Case 5 with ? 0.05.
- But it seems that it has problem finding the
hypothesis with Case 2 and Case 3.
59Case 6 Results(Cont.)
- The following is the data we got in the
- results
- (a) For alpha0.25, from Loop 82 to Loop 5451,
the minimum error stays 0.269104 The difference
values are keeping getting larger and it becomes
409.171 in Loop 5451. - (b) For alpha0.5, from Loop 51 to Loop 7046, the
minimum error stays 0.209656 The difference
values are keeping getting larger and it becomes
360.967 in Loop 7046.
60Case 6 Results Another series of functions
- Another series of the similar target functions
are each function has - 16 variables, 256 terms and 6 literals in each
term. - (a) Case 1. When ? 0, ? 0.45665
- (b) Case 2. When ? 0.1, ? 0.454147
- (c) Case 3. When ? 0.2, ? 0.457825
- (d) Case 4. When ? 0.3, ? 0.45816
- (e) Case 5. When ? 0.4, ? 0.460587
- (f) Case 6. When ? 0.5, ? 0.460587
- (g) Case 7. When ? 0.6, ? 0.472488
- (h) Case 8. When ? 0.7, ? 0.497726
- (i) Case 9. When ? 0.8, ? 0.520004
- (j) Case 10. When ? 0.85, ? 0.551468
- (k) Case 11. When ? 0.9, ? 0.62793
- (l) Case 12. When ? 0.95, ? 0.719345.
61Case 6 Results Another series of functions
- (a) Case 1. When ? 0, ? 0.45665 The
algorithm reaches a minimum error 0.144882 in
Loop 33 and keeps this minimum error until Loop
5185. From the result file, we can see that there
is no new parities added in the hypothesis after
Loop 47. Without adding in new parities, the
weights of the old parities are added - up.
62Case a (cont.)
- In Loop 5185, the selected hypothesis is
63Case a (cont.)
- The difference value for this picked
hypothesis is 99.7091. - For this case, since ? 0, the terms are
supposed to be independent of each other. That
is, it is a random monotone DNF function. But
given the above results, it seems that it may
fail to find the expected hypothesis. As now, we
suppose that although we thought it was random
Monotone DNF, the terms in the function might be
somehow dependent each other. We still need to do
some analysis on this case.
64Case b Results
- (b) Case 2. When ? 0.1, ? 0.454147 The
minimum error is - 0.129028 until Loop 9409. Since some files were
deleted, I can - not find in which loop this minimum error value
was reached. It - was known from my notes that it was already
reached by Loop - 6866. In Loop 8117, the picked hypothesis has the
difference - value 38.94 and in Loop 9409, it has the
difference value - 44.5602. It is also known that after Loop 320,
there is no new - parity added in the hypothesis.
65Case c-h Results
66Only Case l - Finished
- When ? 0.95, ? 0.719345. The algorithm
works well on this case. It reaches the minimum
error 0.053772 in Loop 2357 with the difference
value 2.09869.
67Case 6 Analysis
- We begin to think that maybe the current
algorithm does not work correctly for some of
monotone DNF functions with dependent terms. - We will talk later in the section Future
Work about some changes to the current algorithm
to make it work for these cases.
68Future Work
- Change the coefficient of the constant parity?
- When there are ties in the pool
- a. try to choose randomly from the ties
- b. try to choose the hypothesis with smallest
weight after the parity is added in the
hypothesis or the weight of the current parity is
added up.
69Summary
- 1.We have some cases, some of which are huge
functions, that the algorithm performs very well.
These results are encouraging. - 2.We have problem with the case Recommended by
Dr. Servedio and the monotone DNFs with
dependent terms. We still need to make some
changes to the algorithm to see if the algorithm
will work on these cases.
70Acknowledgements
- Dr. Jeffrey Jackson
- Dr. Donald Simon
- Dr. Frank DAmico
- Dr. Kathleen Taylor
- Other professors in the department
71Thanks for coming!