Learning to construct knowledge bases form the world wide web - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Learning to construct knowledge bases form the world wide web

Description:

... knowledge bases form the world wide web. Yang Lu. luyang_at_cse.msu.edu ... Example: ' A page is a course homepage if it contains the words textbook and TA ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 32

Provided by: cse58

Category:

more less

Transcript and Presenter's Notes

Title: Learning to construct knowledge bases form the world wide web

1
Learning to construct knowledge bases form the
world wide web

Yang Lu
luyang_at_cse.msu.edu
Department of Computer Science and Engineering
Michigan State University
Advisor Dr Sakti Pramanik

2
Learning to construct knowledge bases form the
world wide web

Goal of the project
Automatically create a computer understandable
knowledge base whose content mirrors that of the
world wide web.

3
The problem

Huge amount of web pages.
Classification of web pages.
Relations among the web pages.
Machine understandable.

4
An example

Dr. Pramaniks web page

5
How to construct and maintain the knowledge base

Using machine learning to create information
extraction methods for each of the desired types
of knowledge.
Applying the learned information extraction
methods to extract symbolic, probabilistic
statements direct from the web hypertext.

6
The WebKB system overview

The system is first trained to extract
information of the desired types, then it is
allowed to browse new web sites to extract a new
knowledge base.

7
Training the WebKB system

The training data
A specification of the classes and the relations
of interest.
Training examples that describe instances of the
ontology classes and relations.

8
Example of training data

Classes person, student, course,
research_project
Relations
advisor_of( student, faculty)
The samples are hand labeled.

9
Assumption due to the variation nature of the web

How class instance are described on the web
Each instance of an ontology class is
represented by one or more segments of hypertext
on the web.( a single webpage, continuous text
within a page or a root graph of several pages
linked by hyperlinks.)
How relation instances are described on the web
An undirected path of hyperlinks
A segment of text representing A that contains
the segment that represents B
Hypertext segment for A satisfies some learned
model for relatedness to B.

10
An example from the paper
11
Recognize class instance

Statistical bag-of-words method
Ignore the sequence in which the words occur.
Assume word occurrence is isolated.

12
Recognize class instance

Estimate the word probability
Need to be robust for infrequent words.
Estimate the vocabulary
Limit the vocabulary size to 2000 words.

13
Recognize class instance

Results

14
Recognize class instance

Results
Top vocabulary list for each class

15
Recognize class instance

Results
Top vocabulary list for each class

16
Recognize class instance

Results
Accuracy/coverage of full text

17
Recognize class instance

Results
Accuracy/coverage of hyperlink

18
Recognize class instance

Results
Accuracy/coverage of title/heading

19
Recognize class instance

First-order text classification
Learning to classify web pages by using a learner
which is able to induce first-order rules
Using FOIL algorithm.
Example A page is a course homepage if it
contains the words textbook and TA and is linked
to a page that contains the word assignment.

20
Recognize class instance

First-order text classification
Apply the FOIL 6.4 with default setting to learn
a set of clauses consisting the classes mentioned
above except other.
other is the default class.

21
Recognize class instance

First-order text classification
Learned rule classifier example

Explanation of the rule
The page has the word instructor, but does not
have the word good.
The page contains a hyperlink to a page which
does not contain any hyperlink to other pages.
This linked page contains the word assign.

22
Recognize class instance

Combine the statistical classifier and the
first-order rule based classifier
Using a simple voting algorithm

23
Recognize relation instance

Recognize relations of interest that exist among
extracted class instances.
Relation among class instances are represented by
hyperlink paths in the web.

24
Recognize relation instance

Using FOIL algorithm to derive the relations.
Two step process
Path finding
Rule finding

25
Recognize relation instance

Path finding process an example
Rule finding process FOIL algorithm

26
Extract text fields

Using SRV algorithm
( sequence rules with validation )
SRV is a first-order rule learner based on FOIL.
Input is a set of web pages.
Output is a set of information extraction rules.
Work on the HTML codes and tried to match a
sequence of tokens.

27
Conclusion

Advantages
Automatically extraction.
Good performance.
Disadvantages
Need training sample.
Algorithm needs improvement.

28
Unsupervised information extraction

Soft matching patterns
Represent patterns by
Combining lexical tokens alongside
part-of-speech classes and punctuations.
Adopting a probabilistic framework that combines
slot content and sequential fidelity in computing
the degree of pattern match.

29
Unsupervised information extraction