Learning to construct knowledge bases form the world wide web - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Learning to construct knowledge bases form the world wide web

Description:

... knowledge bases form the world wide web. Yang Lu. luyang_at_cse.msu.edu ... Example: ' A page is a course homepage if it contains the words textbook and TA ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 32
Provided by: cse58
Category:

less

Transcript and Presenter's Notes

Title: Learning to construct knowledge bases form the world wide web


1
Learning to construct knowledge bases form the
world wide web
  • Yang Lu
  • luyang_at_cse.msu.edu
  • Department of Computer Science and Engineering
  • Michigan State University
  • Advisor Dr Sakti Pramanik

2
Learning to construct knowledge bases form the
world wide web
  • Goal of the project
  • Automatically create a computer understandable
    knowledge base whose content mirrors that of the
    world wide web.

3
The problem
  • Huge amount of web pages.
  • Classification of web pages.
  • Relations among the web pages.
  • Machine understandable.

4
An example
  • Dr. Pramaniks web page

5
How to construct and maintain the knowledge base
  • Using machine learning to create information
    extraction methods for each of the desired types
    of knowledge.
  • Applying the learned information extraction
    methods to extract symbolic, probabilistic
    statements direct from the web hypertext.

6
The WebKB system overview
  • The system is first trained to extract
    information of the desired types, then it is
    allowed to browse new web sites to extract a new
    knowledge base.

7
Training the WebKB system
  • The training data
  • A specification of the classes and the relations
    of interest.
  • Training examples that describe instances of the
    ontology classes and relations.

8
Example of training data
  • Classes person, student, course,
    research_project
  • Relations
  • advisor_of( student, faculty)
  • The samples are hand labeled.

9
Assumption due to the variation nature of the web
  • How class instance are described on the web
  • Each instance of an ontology class is
    represented by one or more segments of hypertext
    on the web.( a single webpage, continuous text
    within a page or a root graph of several pages
    linked by hyperlinks.)
  • How relation instances are described on the web
  • An undirected path of hyperlinks
  • A segment of text representing A that contains
    the segment that represents B
  • Hypertext segment for A satisfies some learned
    model for relatedness to B.

10
An example from the paper
11
Recognize class instance
  • Statistical bag-of-words method
  • Ignore the sequence in which the words occur.
  • Assume word occurrence is isolated.

12
Recognize class instance
  • Estimate the word probability
  • Need to be robust for infrequent words.
  • Estimate the vocabulary
  • Limit the vocabulary size to 2000 words.

13
Recognize class instance
  • Results

14
Recognize class instance
  • Results
  • Top vocabulary list for each class

15
Recognize class instance
  • Results
  • Top vocabulary list for each class

16
Recognize class instance
  • Results
  • Accuracy/coverage of full text

17
Recognize class instance
  • Results
  • Accuracy/coverage of hyperlink

18
Recognize class instance
  • Results
  • Accuracy/coverage of title/heading

19
Recognize class instance
  • First-order text classification
  • Learning to classify web pages by using a learner
    which is able to induce first-order rules
  • Using FOIL algorithm.
  • Example A page is a course homepage if it
    contains the words textbook and TA and is linked
    to a page that contains the word assignment.

20
Recognize class instance
  • First-order text classification
  • Apply the FOIL 6.4 with default setting to learn
    a set of clauses consisting the classes mentioned
    above except other.
  • other is the default class.

21
Recognize class instance
  • First-order text classification
  • Learned rule classifier example
  • Explanation of the rule
  • The page has the word instructor, but does not
    have the word good.
  • The page contains a hyperlink to a page which
    does not contain any hyperlink to other pages.
  • This linked page contains the word assign.

22
Recognize class instance
  • Combine the statistical classifier and the
    first-order rule based classifier
  • Using a simple voting algorithm

23
Recognize relation instance
  • Recognize relations of interest that exist among
    extracted class instances.
  • Relation among class instances are represented by
    hyperlink paths in the web.

24
Recognize relation instance
  • Using FOIL algorithm to derive the relations.
  • Two step process
  • Path finding
  • Rule finding

25
Recognize relation instance
  • Path finding process an example
  • Rule finding process FOIL algorithm

26
Extract text fields
  • Using SRV algorithm
    ( sequence rules with validation )
  • SRV is a first-order rule learner based on FOIL.
  • Input is a set of web pages.
  • Output is a set of information extraction rules.
  • Work on the HTML codes and tried to match a
    sequence of tokens.

27
Conclusion
  • Advantages
  • Automatically extraction.
  • Good performance.
  • Disadvantages
  • Need training sample.
  • Algorithm needs improvement.

28
Unsupervised information extraction
  • Soft matching patterns
  • Represent patterns by
  • Combining lexical tokens alongside
    part-of-speech classes and punctuations.
  • Adopting a probabilistic framework that combines
    slot content and sequential fidelity in computing
    the degree of pattern match.

29
Unsupervised information extraction
  • Tokens for the soft matching patterns

30
Unsupervised information extraction
  • Example of generalizing soft patterns

31
Unsupervised information extraction
  • Pattern matching algorithm
  • Cosine similarity.
  • Express the soft pattern as following
  • ltSlot-w, , Slot-2, Slot-1, SCH_TERM , Slot1,
    Slot2, Slotw Pagt
  • lt(tokeni1, weighti1), (tokeni2, weighti2)
    (tokenim, weightim) Slotigt
Write a Comment
User Comments (0)
About PowerShow.com