Learning to Construct Knowledge Bases from the World Wide Web by Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigam, Sean Slattery - PowerPoint PPT Presentation

About This Presentation
Title:

Learning to Construct Knowledge Bases from the World Wide Web by Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigam, Sean Slattery

Description:

The winning class tends to have probability 1 (the artifact of the na ve assumption) ... Pr(c) prior probability of any class; ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 34
Provided by: igoryak
Category:

less

Transcript and Presenter's Notes

Title: Learning to Construct Knowledge Bases from the World Wide Web by Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigam, Sean Slattery


1
Learning to Construct Knowledge Bases from the
World Wide Webby Mark Craven, Dan DiPasquo,
Dayne Freitag, Andrew McCallum, Tom Mitchell,
Kamal Nigam, Sean Slattery
Igor Yakymenko iy_at_cse.buffalo.edu Department of
Computer Science and Engineering SUNY at Buffalo
2
Fig. 1. An overview of the WebKB system
3
  • Two of the entities automatically extracted from
    the CMU computer science department Web site
    after training on four other university computer
    science sites. These entities were added as new
    instances of faculty and project to the knowledge
    base from Web hypertext.

4
Part II
  • Pages 11-29
  • Appendix B
  • Appendix C

5
Learning to Recognize Class Instances
  • Task
  • to identify new instances of ontology classes
    from the text sources on the Web.
  • Discussion
  • A statistical bag-of-words approach to
    classifying Web pages. This method is used along
    with 3 different representation of pages.
  • Learning first-order rules to classify Web pages.
  • Evaluation of the effectiveness of combining the
    predictions made by all 4 of these classifiers.

6
Naive Bayes
  • 2 common approaches
  • multi-variate Bernoulli model (binary word
    count)
  • multinomial model (integer word count).

Given a set of classes C c1, cn and a
document consisting of n words, (w1, w2,wn), we
classify the document as a member of the class,
c, that is the most probable, given the words in
the document
Transform Pr(cw1,wn) by applying Bayes
Rule Rewrite the expression using the product
rule and dropping the denominator Assume that
words are independent of each other
7
Naive Bayes Classifier Limitations
  1. The Naïve Bayes is not suitable to estimate the
    level of confidence for all classes.
  2. The winning class tends to have probability 1
    (the artifact of the naïve assumption) .
  3. The losing classes tend to have posterior
    probabilities close to 0.

Authors proposals is to modify the existing
formulae to overcome those limitations.
8
Modifications to Naive Bayes
Goal- scores that accurately reflect the
uncertainty in each prediction and enable to
sensibly compare the scores of multiple documents
(smooth function of confidence)
Begin with naive Bayes, rewrite the sum to an
equivalent expression that sums over all words in
the vocabularly T instead of just the words in
the document (B.1), take the log (B.2), and
divide by the number of words in the document
(B.3).
9
Modifications to Naive Bayes (continued)
By substituting N(wid)/n as Pr(wid), the
authors derived the following formula
where n number of words in a document Pr(c)
prior probability of any class Pr(wid)
probability (frequency) of word that encountered
in the document d T the whole
vocabulary Pr(wic) probability (frequency) of
word that encountered in the class c.
10
Modifications to Naive Bayes (continued)
Subtracting optimal encoding
for a given document gives us the final
formula of the score for all classes. The biggest
score will determine the entity for a specific
document
The right side of the equation is negative
relative (cross) entropy - Measure of how
different two probability distributions are -
The average number of bits that are wasted by
encoding events from a distribution p with a code
based on a not-quite-right distribution q.
11
Naive Bayes Classifier (conclusion)
Approach Building a probabilistic model of each
class using labeled training data, and then
classifying new pages by selecting the most
appropriate class. Given a document d to
classify, we calculate a score for each class c
(The class predicted by the method is the class
with the greatest score)
Pr(wic) is the probability of random word w
given class c Pr(wid) is the proportion of a
word w in document d.
  • n is number of words in d
  • T is the size of the vocabulary
  • wi is the ith word in the vocabulary

12
Experimental Evaluation
13
Experimental Evaluation
To obtain insight into the learned classifiers by
asking which words contribute most highly to the
quantity Score c(d) for each class
Most of the highly weighted words are intuitively
prototypical for their class.
Many words which are conventionally included in
stop list are highly weighted by the model and
was included into the vocabulary.
14
(No Transcript)
15
Coverage the percentage of pages of a given
class that are correctly classified Accuracy
the percentage of pages classified into a given
class that are actually members of that class
16
(No Transcript)
17
(No Transcript)
18
First-Order Text Classification Quinlans FOIL
algorithm, Introduction
Two families of First-order Learning Systems 1.
Successive relation method A faulty theory is
too general if it covers negative examples, and
too specific if it does not cover all positive
examples. The theory is revised until all
examples are covered. 2. Separate and Conquer
Strategy (greedy algorithms) All examples are
considered together and each iteration new
element (a.k.a literal) is added that covers some
positive examples, but no negative examples.
J.R. Quinlan, R.M Cameron-Jones Introduction to
Logic programs FOIL and Related Systems
19
Description of FOIL
As an example of a task, consider learning a
definition of the membership relation on lists
from a small world containing just the lists ,
1, 2, 3, 1,2, 2,3, and 1,2,3. The
target relation member(E,L) contains pairs whose
pairs constant denotes an element that belongs in
the list denoted by the second. In this small
world there are just ten elements in
member   lt1,1gt lt2,2gt lt3,3gt
lt1,1,2gt lt2,1,2gt lt2,2,3gt
lt3,2,3gt lt1,1,2,3gt lt2,1,2,3gt
lt3,1,2,3gt As far as foil is concerned, lists
like 1,2,3 are just constants, so a background
relation components(L,H,T) is required to show
how to find the head H and tail T of a list L.
The elements making up components are   lt1,1,
gt lt2,2, gt lt3,3, gt lt1,2,1,2gt
lt2,3,2,3gt lt1,2,3,1,2,3gt   where the
first states that list 1 has head 1 and tail
.
20
Description of FOIL (continued)
/ Top down Approach / Initialization theory
null program / learning concept
member(E,L)/ remaining all positive elements
of target relation R / lt1,1gt.. lt2,2gt
/ Iteration  While remaining is not empty /
some positive examples are not classified
/ clause R(A,B ) - While clause covers
- negative elements Find appropriate
literal(s) L (a.k.a background relationship)
Add L to right-hand side of clause Remove
positive elements covered by clause from
remaining R Add clause to theory
21
Description of FOIL (continued)
Initialization We illustrate the process using
the member(E,L) relation. The initial clause
consists of just the first Literal   member(A,B)
(where A is Element, B List )   The set of
examples corresponding to this initial partial
clause are just the all possible positive and
negative elements of relation member(A,B).   All
10 positive examples lt1,1gt() lt2,2gt()
lt3,3gt() lt1,1,2gt()
lt2,1,2gt() lt2,2,3gt() lt3,2,3gt()
lt1,1,2,3gt() lt2,1,2,3gt()
lt3,1,2,3gt()   some negative examples lt1,
gt(-) lt1,2gt(-) lt1,3gt(-) lt1,2,3gt(-)
lt2, gt(-)
22
Description of FOIL (continued)
The literal L components(L,H,T) is now added to
the clause body to give This is intermediate
theory   member(A,B) - components(B,A,C)   the
new clause has three variables and is satisfied
the following elements ltA,B,Cgt   lt1,1,
gt() lt2,2, gt() lt3,3, gt()
lt1,1,2,2gt() lt2,2,3,3gt()
lt1,1,2,3,2,3gt()   For instance, lt1,1, gt
is removed from remaining because the values A1,
B1, C . Another words, if an element is a
header H, it is a member of the list L. We have
only 4 positive examples that cannot be described
by above relationships member(A,B) -
components(B,A,C) lt2,1,2gt()
lt3,2,3gt() lt2,1,2,3gt() lt3,1,2,3gt()
23
Description of FOIL (continued)
Adding a further literal to give the new partial
clause member(A,B) - components(B,A,C)  member(A
,B) - components(B,C,D), member(A,D)  eliminates
the rest 4 positive examples lt2,1,2,1,2gt()
lt3,2,3,2,3gt() lt2,1,2,3,1,2,3gt
() lt3,1,2,3,1,2,3gt() Each example
that makes the relationship member(E,L) is moved
to target relation member(A,B) -
components(B,A,C). member(A,B) -
components(B,C,D), member(A,D) So the definition
of member(E,L) is complete and can be used for
other than 1,2, and 3 elements. Example
member(4,1,2,3,4) member(4,1,2,3,4)
components(1,2,3,4,1,2,3,4),
member(4,2,3,4) member(4,2,3,4)
components(2,3,4,2,3,4),
member(4,3,4) member(4,2,3,4)
components(2,3,4,2,3,4),
member(4,3,4) member(4,3,4)
components(3,4,3,4),
member(4,4)
24
First-Order Text Classification
  • Quinlans FOIL algorithm for WebKB
  • A greedy algorithm for learning function-free
    clauses.
  • Background relations (stemmed words with 200
    occurrences)
  • has_word(Page) indicates which words occur in
    which pages.
  • link_to(Page, Page) represents the hyperlinks
    that interconnect the pages in the data set.

For all FOIL class classifiers the m-estimate
accuracy was calculated to determine the winning
class for each document (d)
  • nc the number of instances correctly classified
    by the rule
  • n the total number of instances classified by
    the rule
  • p a prior estimate of the rules accuracy
  • m a constant called equivalent sample size
    which determines how heavily p is weighted
    relative to the observed data (m 2)

25
A few of the rules learned by FOIL
  • For relationship course(A) the FOIL algorithm
    learnt the following
  • Page has instructor word, but not the word good.
  • The page has link to other page which doesnt
    contain any links.
  • This linked page contained the word assign.

26
Experimental Evaluation
27
Combining Learners
  • Method for combining predictions of classifiers
  • Simple voting scheme among all four classifiers
    (majority of votes made by the individual
    classifiers)
  • In case of tie the confidence level is used as a
    tie-breaker
  • To ensure comparability
  • Calibrate each classifier by inducing a mapping
    from its output scores to the probability of a
    prediction being correct
  • Partitioning the score produced by each
    classifier into bins and then measuring the
    training-set accuracy of the scores that fall
    into each bin.

28
Experimental Evaluation
29
Experimental Evaluation
30
Identifying Multi-Page Segments
  • To develop methods for identifying sets of
    interlinked pages that represent a single
    knowledge base instance
  • Prior assumption one page one instance (primary
    page and others)
  • New approach of grouping related pages together
    (using regularities in URL structures)
  • Identifying the most representative page of a
    group (for example // naming pattern
    identifies entity person)
  • Main page could be identified by file name index,
    home, cs???

31
The URL Grouping Algorithm (Appendix C)
32
Experimental Evaluation
33
Future work
  • Methods for classification documents
  • Bayesian Learning Minimum Description Length
    (MDL)
  • Symbolic Learning Decision Trees
  • k-NN (Nearest Neighbor ) algorithm
Write a Comment
User Comments (0)
About PowerShow.com