Toward Uniformdistribution Monotone DNF Learning: A Proposed Algorithm - PowerPoint PPT Presentation

1 / 71
About This Presentation
Title:

Toward Uniformdistribution Monotone DNF Learning: A Proposed Algorithm

Description:

Parity function ... In the algorithm we build up the hypothesis by choosing a new parity in every loop. ... If some parity is not a subset of any term in the ... – PowerPoint PPT presentation

Number of Views:151
Avg rating:3.0/5.0
Slides: 72
Provided by: Mat6157
Category:

less

Transcript and Presenter's Notes

Title: Toward Uniformdistribution Monotone DNF Learning: A Proposed Algorithm


1
Toward Uniform-distribution Monotone DNF
LearningA Proposed Algorithm
  • Presented by Wenzhu Bi
  • Department of Mathematics and Computer Science
  • Duquesne University
  • Advisor Dr. Jeffrey Jackson

2
Outline
  • Some definitions
  • Why we working on DNF learning
  • The proposed problem
  • Pervious work
  • Our work and the algorithm
  • Performed tests and analysis
  • Future work.

3
Some Definitions
  • AND
  • The Boolean function is true if and only if all
    the literals are true. For example, p?q is true
    if and only if p and q are both true.
  • OR
  • The Boolean function is false only if all the
    literals are false. For example, p?q is false if
    and only if p and q are both false.
  • Negation
  • If q is true, then ?q is false if q is false,
    then ?q is true.

4
More Definitions
  • DNF(Disjunctive Normal Form)
  • An OR of ANDs. Conjunctions in a DNF are called
    terms. The size of a DNF is defined as the
    number of terms that it has.
  • For example x1x2x3x4x5. It is a 2-term DNF.
  • CNF(Conjunctive Normal Form)
  • An AND of ORs
  • For example (x1)(x2)(x3)(x4x5x6)
  • Monotone DNF
  • A DNF with no negated variables.
  • For example x1x2x3x4x5

5
Why we are interested in DNF Learning?
6
DNF is a natural means of representing many
"expert" rules

ifit is raining OR the forecast calls for
rainAND you trust weather forecastsORyou like
to always be preparedANDyou are taking a
briefcaseANDyour briefcase is not fullthen
take an umbrella when you go out.
7
More practical Areadigital circuit design and
implementation
8
The Proposed Problem
  • In 1984 Valiant introduced the distribution-indepe
    ndent model of Probably Approximately Correct
    (PAC) learning from random examples and brought
    up the problem of whether polynomial-size DNF
    functions are PAC learnable in polynomial time.
    It has been about twenty years that the DNF
    learning problem has been widely regarded as one
    of the most important ---and challenging --- open
    questions in Computational Learning Theory.

9
General DNF Learning
  • It has been shown by Dr. Jackson that general
    DNF is strongly learnable with respect to the
    uniform distribution using membership queries.

10
Previous Work
  • Because of the lack of progress on learning
    monotone DNF in the distribution-independent
    setting, instead of approaching this result
    directly, many researchers study the restricted
    versions, such as learning monotone DNF with
    uniformly-distributed examples.
  • Notice that the problem is restricted in two
    ways
  • 1. General DNF ? Monotone DNF
  • 2. Arbitrary Distribution ? Uniform

11
Previous Work(Cont.)
  • Hancock and Mansour gave a polynomial time
    algorithm for learning monotone read-k DNF (DNF
    in which every variable appears at most k times
    in which k is a constant) under a family of
    distributions(including uniform distribution).
  • Then there is a series of research work, done by
    Verbeurgt, Kucera, Sakai, Maruoka, Bshouty and
    Tamon, to improve the monotone DNF learning with
    uniform distribution.
  • The latest best work is done by Servedio to prove
    that that the class of monotone -term
    DNF formulae can be PAC learned in polynomial
    time under the uniform distribution from random
    examples only.

12
Our Work
  • Our algorithm attempts to learn a monotone DNF
    f using a hypothesis h (h is not necessarily
    monotone DNF) in polynomial time from a set of
    uniformly-distributed samples generated by f.

13
The Algorithm
  • Input
  • Target function f
  • f is a monotone DNF function.
  • Output
  • A hypothesis h such that Prf?h ?.
  • In our particular case, h is a threshold
    function hsignF.
  • F is a sum of parity functions.

14
Definitions in the Algorithm
  • Threshold Function
  • h signF
  • h is 1 if Fgt0 while h is -1 if Flt0.

15
More Definitions in the Algorithm
  • Parity function
  • The function has a value of 1 when the parity
    of the number of 1s in x indexed by a is even
    -1 when the parity of the number of 1s in x
    indexed by a is odd.
  • Xn X3 X2 X1
  • a


The value of the parity
16
The Hypothesis
  • h signF
  • in which,
  • S is a polynomial-size set of bitvectors
  • is an integer called Fourier
    Coefficient of F.

17
The Algorithm
  • Initialize F 1, then h signF 1. Check if
    Prf?h ?. If yes, the algorithm stops
    otherwise, continue to next step.
  • 2. Find the neighbors of the current parities or
    add up the weights for existing parities to
    produce the potential hypotheses.

18
How to choose the best hypothesis
  • If h signF is a perfect hypothesis(i.e.
    for all x, h (x) f(x)), then
  • In the algorithm we build up the hypothesis
    by choosing a new parity in every loop. In each
    stage, we select the hypothesis with the new
    parity which minimizes the difference between the
    two sides of the above equation.

19
The Algorithm Continued
  • Step 2 continued
  • Then pick up the hypothesis which minimizes the
    difference. Then check if Prf?h ?. If yes,
    the algorithm stops otherwise, repeat Step 2.

20
An example
  • f x1x2x3

21
FFT(Butterfly)-Fast Fourier Transforms
22
  • Each Fourier coefficient requires time
  • to compute exactly. We can compute all of the
    coefficients in time using the FFT.
  • In our tests, we perform this exact
    computation so that we can leave out the
    variability added by estimating the Fourier
    coefficients. In an actual implementation, these
    coefficients would be estimated using
    polynomial-size set of examples
  • (sufficient for PAC learning by Chernoff
    bound).

23
Step 1 F1
  • Then h signF sign1 1.
  • And S000, 1, and -0.75.
  • Check all the examples over the truth
  • table,
  • Prf?h 0.875
  • which is greater than ? (suppose ? is
  • 0.05). So we continue to step 2.

24
Step 2 loop 1
  • Current hypothesis h signF and F1 With S
    .
  • Then Neighbors of are 001, 010, and 100.
  • So the potential hypotheses will be
  • F 3
  • F 1 - 2?001(x)
  • F 1 - 2?010(x)
  • F 1 - 2?100(x)

25
the hypothesis pool
26
Choose the hypothesis h signF sign1
- 2?001(x)
  • With S000,001.
  • Check all the examples over the truth
  • table,
  • Prf?h 0.375
  • which is greater than ? (suppose ? is
  • 0.05). So we repeat step 2.

27
Step 2 loop 2
  • Current hypothesis
  • h signF sign1 - 2?001(x) with S000,001
  • Then Neighbors of 000 are 001, 010, and 100
  • Neighbors of 001 are 000, 011, and 101.
  • Since 000 and 001 are already in the current
  • hypothesis, so the new neighbors pool will not
  • include these two parities.
  • Then the current neighbors pool temporarily is
  • 010,100,011, 101.

28
Step 2 loop 2(cont.)
  • Here we have another rule added in
  • For any parity x(except the constant parity) in
    F, if x is a subset of x,
  • then the weight of x must be greater than or
    equal to the weight of x.
  • It also means that a new parity can be added in
    the new pool only if all
  • its immediate down neighbors are already in the
    pool.
  • So the new neighbors pool S will be 010,100.
    Then the potential
  • hypotheses will be
  • F 3 - 2?001(x)
  • F 1 - 4?001(x)
  • F 1 - 2?001(x)- 2?010(x)
  • F 1 - 2?001(x)- 2?100(x)

29
the hypothesis pool
30
Choose the hypothesis hsignF
sign1-2?001(x)-2?010(x)
  • With S000,001,010.
  • Check all the examples over the truth
  • table,
  • Prf?h 0.375
  • which is greater than ? (suppose ? is
  • 0.05). So we repeat step 2.

31
Step 2 loop 3
  • Current hypothesis
  • h signF sign1 - 2?001(x) - 2?010(x) with
    S000,001,010
  • Then Neighbors of 000 are 001, 010, and 100
  • Neighbors of 001 are 000, 011, and 101
  • Neighbors of 010 are 000, 011, and 110.
  • Since 000, 001 and 010 are already in the current
  • hypothesis, so the new neighbors pool will not
  • include these three parities.
  • Then the current neighbors pool temporarily is
  • 100,011,101,110.

32
Step 2 loop 2(cont.)
  • By the subsets rule,
  • So the new neighbors pool S will be 100,011.
    Then
  • the potential hypotheses will be
  • F 3 - 2?001(x) - 2?010(x)
  • F 1 - 4?001(x) - 2?010(x)
  • F 1 - 2?001(x) - 4?010(x)
  • F 1 - 2?001(x) - 2?010(x) 2 ?100(x)
  • F 1 - 2?001(x) - 2?010(x) 2 ?011(x)

33
the hypothesis pool
34
Choose the hypothesis h
sign1-2?001(x)-?010(x) 2 ?011(x)
  • With S000,001,010,011.
  • Check all the examples over the truth
  • table,
  • Prf?h 0.125
  • which is greater than ? (suppose ? is
  • 0.05). So we repeat step 2.

35
Then repeat Step 2
  • Finally in Loop 7, we find the hypothesis
  • h sign1-2?001(x)-2?010(x) 2 ?011(x)
  • - 2?100(x) 2?101(x) 2 ?110(x)-
    2?111(x).
  • Check all the examples over the truth table,
  • Prf?h 0
  • which is less than ? (suppose ? is 0.05).
  • So the algorithm ends finding the final best
  • hypothesis h.

36
Performed Tests and Analysis
37
How to analyze the results
  • 1. Each parity in the hypothesis should be a
    subset of the terms in the target function. If
    some parity is not a subset of any term in the
    target function, this will be a sign to break the
    algorithm otherwise, it is working as we expect.

38
How to analyze the results(cont.)
  • 2.Check if the error values keep decreasing.
  • The error values may bounce up and down. But
    the minimum error values are expected to be
    decreasing. When the minimum error is smaller
    than , the algorithm stops and finds the final
    best hypothesis.

39
Case 1 Simple Test Cases
  • f x1x2x3
  • f x1x2 x3
  • f x1x2 x3x4
  • f x1x2 x3x5 x4x1 x2x3
  • f x1x2x3x4 x5x6x7x8
  • f x1x2x3 x4x5x6 x7x8x9
  • f x1x2x3 x4x5x6 x7x8x9 x10x11x12
  • f x1x2x3 x4x5x6 x7x8x9 x10x11x12
    x13x14x15
  • f x1x2x3x4 x6x5x7x8 x9x10x11x12
    x13x14x15x16
  • f x15x2x3x4 x6x5x7x8 x9x10x11x12
    x13x14x1x16
  • f x1x2x3x4 x5x6x7x8 x9x10x11x12 x13x14x15
    x16x17x18
  • f x1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x
    17x18x19
  • f x1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16x
    17x18x19x20

40
f x1x2x3x4 x6x5x7x8 x9x10x11x12
x13x14x15x16
41
Case 2 Read-once monotone DNF
  • Read-once monotone DNF has u terms
  • and each term has log(u) distinct variables.
  • We tested
  • x1x2 x3x4 x5x7 x6x8
  • The algorithm found the best hypothesis easily.

42
Case 3 Random monotone DNF
  • Drs. Jackson and Servedio recently gave the
    proof that the n-terms random monotone DNF with
    log(n) literals in each term is learnable in
    polynomial time from uniform random examples with
    high probability.
  • We tested cases with n2-terms random monotone
    DNF with 2log(n) literals in each term because
    Jackson and Servedios result does not extend to
    this case. We tried 2 cases with n8 and n16.

43
Case 3 Results
  • When n8, f has 8 variables, 64 terms and 6
    literals in each
  • term. We tested with the following function.
  • x1x6x3x5x7x2x8x6x7x4x1x5x4x2x8x6x5x7x6x4x5x3x7
    x1
  • x4x1x2x3x5x8x8x3x5x6x7x2x1x3x6x5x2x8x1x4x8x6x7
    x3
  • x4x7x5x2x6x1x3x2x6x1x8x5x8x2x3x5x1x4x2x8x7x4x3
    x1
  • x4x2x6x5x8x1x6x1x8x7x2x3x1x4x6x3x8x7x2x1x4x3x7
    x5
  • x8x3x5x2x7x4x1x5x3x4x7x8x3x1x4x5x8x2x1x3x4x2x7
    x5
  • x3x6x5x4x8x2x3x5x2x6x1x7x3x7x5x2x6x8x5x6x7x3x2
    x8
  • x2x6x3x5x8x1x1x4x5x2x3x8x1x7x4x2x6x3x2x6x1x4x8
    x7
  • x8x7x3x5x4x1x5x4x2x1x3x8x1x8x6x7x2x5x6x3x2x5x8
    x1
  • x7x1x3x8x6x5x8x3x6x5x2x1x3x6x7x8x4x1x7x5x1x2x3
    x4
  • x8x3x1x5x6x4x8x5x1x4x7x3x5x7x2x4x6x8x3x4x5x8x2
    x7
  • x2x4x6x3x5x7x5x6x8x3x4x7x4x7x2x8x3x5x5x8x1x2x4
    x6
  • x4x8x5x3x6x1x6x5x2x8x3x4x8x5x2x3x4x7x6x7x3x4x5
    x1
  • x5x4x1x2x7x8x8x5x1x6x3x2x5x6x3x1x4x8x4x6x1x3x5
    x7
  • x4x1x6x2x5x8x8x4x1x6x5x3x8x3x4x5x7x6x2x4x8x3x1
    x5
  • x4x8x7x2x1x3x4x2x3x8x6x7x3x7x5x4x2x6x3x2x6x1x7
    x4
  • x3x7x4x1x5x6 x2x6x7x5x8x4 x4x5x3x2x8x6
    x6x8x2x7x5x3

44
The hypothesis hsignF
45
Case 3 Results
  • When n16, f has 16 variables, 256 terms and
    8 literals in each term. We tested with the
    following function.
  • This case was not finished. The result so far
    we got is that the minimum error is 0.111252 in
    Loop 2349 with a difference value 6.38239. Before
    that, the minimum error stayed 0.112366 from Loop
    1363 to Loop 2238 and then it was changed to
    0.112183, then to 0.111252. Since the difference
    value is still small and the minimum value is
    still getting smaller, it might be reasonable to
    say that for this case it may be finished with
    expected results. (This case had run for several
    days and then was stopped because the computer
    was restarted by other people. We will try to
    test this case again if we have time.)

46
Case 4 Monotone DNF with significant sharing
of variables between terms
  • We have tested a case like this Whenever x1
    is selected for a term, x2 and x3 will also be in
    the term with probability 0.5 each.
  • That is, whenever x1 is selected for a term
  • With a probability 0.25 that both x2 and x3
    will be in the term
  • With a probability 0.25 that only x2 will be
    in the term
  • With a probability 0.25 that only x3 will be
    in the term
  • With a probability 0.25 that neither x2 nor
    x3 will be in the term.

47
Case 4 Cont.
  • This means that x2 and x3 have the equal
    probability appearing
  • in the function. This might confuse the
    algorithm when facing
  • the problem whether x2 or x3 should be chosen.
  • We tested the case with 2 examples
  • (a) One example has 8 variables, 64 terms and 6
    literals in each term
  • (b) The other example has 10 variables, 64 terms
    and 6 literals in each term.
  • These two cases were all performed well to get
    the expected
  • results.

48
Case 5
  • Dr. Servedio recommended this case
  • (x1x2 x3x4)(x5x6 x7x8)
  • (x9x10 x11x12)(x13x14 x15x16)
  • which is an AND-OR tree, i.e. a binary tree
    where odd layers all have AND and even layers all
    have OR.

49
Case 5 Results
  • If it is converted to a Monotone DNF
  • function, it is as the following
  • x1x2x5x6 x1x2x7x8 x3x4x5x6 x3x4x7x8
    x9x10x13x14 x9x10x15x16 x11x12x13x14
    x11x12x15x16

50
Case 5 Results
  • For this case the algorithm didnt finish
    with the expected results.
  • The result is the minimum error was
    0.134216, which began from Loop 135. And this
    minimum error stayed 0.134216 until Loop 2681.
    The difference values are not getting large,
    staying around 1 and 2. For Loop 2681, the
    difference is 1.28302.
  • Since it cost so many loops without reaching
    a smaller error value and some parity functions
    are not subsets of the terms in the target
    function, it appears that the algorithm
  • will not work out for this case.

51
Case 5 Analysis
  • Since a DNF expression can also be represented
    by a CNF, it is not guaranteed that the algorithm
    will find the terms in the DNF instead of the
    clauses in the CNF. It was expected that this
    case of monotone DNF may be hard to learn since
    the algorithm may find the parities which are the
    clauses in the CNF.

52
Case 6 Monotone DNF with dependent terms
  • 1st term x1x7x4
  • 2nd term xaxbxc.
  • 3rd term xdxexf
  • Suppose ?0.3
  • In 2nd term, a is 1 with a prob 0.3 a is chosen
    randomly from other remaining variables with a
    prob 0.7
  • Same rule for b, c, d,e.

53
Case 6 Cont.
  • Assume there is a variable ? (0?1), which is
    used to indicate how dependent between terms in
    the target function.
  • The larger the value of ?, the terms are more
    dependent.

54
Case 6 Cont.
  • For this case, the value of ? can be changed
    continuously from 0 to 1.
  • 1. ? 0, all the terms are generated
    independent of each other
  • 2. 0lt?lt1, the terms in the target function
    are dependent with a coefficient ?.

55
Why we test this case?
  • Since the value of ? can be changed
    continuously from 0 to 1, then we can get a
    series of functions with different dependent
    degree between terms.
  • If the series of functions can be learned, it
    means that the algorithm works no matter the
    terms are dependent or independent. This result
    will be very encouraging.

56
Note
  • To get the valuable test results, we need to
    choose the test cases. We prefer to choose the
    target function f with a probability of about
    0.5 to be 1 and a probability of about 0.5 to be
    -1. If a target function satisfies with the
    probabilities requirements, it is balanced
    otherwise, it is biased.
  • Suppose the variable ? is the probability of
    the target function to be -1, then the more
    skewed the target function, the larger the value
    is ? away from 0.5.
  • When the function is more balanced, it is harder
    for the algorithm to learn.

57
Case 6 Results
  • Then we found a series of test cases each
    function
  • has 16 variables, 20 terms and 4 literals in each
  • term. And
  • (a) Case 1. When ? 0, ? 0.465515
  • (b) Case 2. When ? 0.25, ? 0.462097
  • (c) Case 3. When ? 0.50, ? 0.516663
  • (d) Case 4. When ? 0.75, ? 0.648193
  • (e) Case 5. When ? 0.95, ? 0.875.

58
Case 6 Results(Cont.)
  • The algorithm performs very well in finding the
    hypothesis on Case
  • 1, Case 4 and Case 5 with ? 0.05.
  • But it seems that it has problem finding the
    hypothesis with Case 2 and Case 3.

59
Case 6 Results(Cont.)
  • The following is the data we got in the
  • results
  • (a) For alpha0.25, from Loop 82 to Loop 5451,
    the minimum error stays 0.269104 The difference
    values are keeping getting larger and it becomes
    409.171 in Loop 5451.
  • (b) For alpha0.5, from Loop 51 to Loop 7046, the
    minimum error stays 0.209656 The difference
    values are keeping getting larger and it becomes
    360.967 in Loop 7046.

60
Case 6 Results Another series of functions
  • Another series of the similar target functions
    are each function has
  • 16 variables, 256 terms and 6 literals in each
    term.
  • (a) Case 1. When ? 0, ? 0.45665
  • (b) Case 2. When ? 0.1, ? 0.454147
  • (c) Case 3. When ? 0.2, ? 0.457825
  • (d) Case 4. When ? 0.3, ? 0.45816
  • (e) Case 5. When ? 0.4, ? 0.460587
  • (f) Case 6. When ? 0.5, ? 0.460587
  • (g) Case 7. When ? 0.6, ? 0.472488
  • (h) Case 8. When ? 0.7, ? 0.497726
  • (i) Case 9. When ? 0.8, ? 0.520004
  • (j) Case 10. When ? 0.85, ? 0.551468
  • (k) Case 11. When ? 0.9, ? 0.62793
  • (l) Case 12. When ? 0.95, ? 0.719345.

61
Case 6 Results Another series of functions
  • (a) Case 1. When ? 0, ? 0.45665 The
    algorithm reaches a minimum error 0.144882 in
    Loop 33 and keeps this minimum error until Loop
    5185. From the result file, we can see that there
    is no new parities added in the hypothesis after
    Loop 47. Without adding in new parities, the
    weights of the old parities are added
  • up.

62
Case a (cont.)
  • In Loop 5185, the selected hypothesis is

63
Case a (cont.)
  • The difference value for this picked
    hypothesis is 99.7091.
  • For this case, since ? 0, the terms are
    supposed to be independent of each other. That
    is, it is a random monotone DNF function. But
    given the above results, it seems that it may
    fail to find the expected hypothesis. As now, we
    suppose that although we thought it was random
    Monotone DNF, the terms in the function might be
    somehow dependent each other. We still need to do
    some analysis on this case.

64
Case b Results
  • (b) Case 2. When ? 0.1, ? 0.454147 The
    minimum error is
  • 0.129028 until Loop 9409. Since some files were
    deleted, I can
  • not find in which loop this minimum error value
    was reached. It
  • was known from my notes that it was already
    reached by Loop
  • 6866. In Loop 8117, the picked hypothesis has the
    difference
  • value 38.94 and in Loop 9409, it has the
    difference value
  • 44.5602. It is also known that after Loop 320,
    there is no new
  • parity added in the hypothesis.

65
Case c-h Results
  • Similar as Case b.

66
Only Case l - Finished
  • When ? 0.95, ? 0.719345. The algorithm
    works well on this case. It reaches the minimum
    error 0.053772 in Loop 2357 with the difference
    value 2.09869.

67
Case 6 Analysis
  • We begin to think that maybe the current
    algorithm does not work correctly for some of
    monotone DNF functions with dependent terms.
  • We will talk later in the section Future
    Work about some changes to the current algorithm
    to make it work for these cases.

68
Future Work
  • Change the coefficient of the constant parity?
  • When there are ties in the pool
  • a. try to choose randomly from the ties
  • b. try to choose the hypothesis with smallest
    weight after the parity is added in the
    hypothesis or the weight of the current parity is
    added up.

69
Summary
  • 1.We have some cases, some of which are huge
    functions, that the algorithm performs very well.
    These results are encouraging.
  • 2.We have problem with the case Recommended by
    Dr. Servedio and the monotone DNFs with
    dependent terms. We still need to make some
    changes to the algorithm to see if the algorithm
    will work on these cases.

70
Acknowledgements
  • Dr. Jeffrey Jackson
  • Dr. Donald Simon
  • Dr. Frank DAmico
  • Dr. Kathleen Taylor
  • Other professors in the department

71
Thanks for coming!
  • 07/28/2004
Write a Comment
User Comments (0)
About PowerShow.com