Title: Learning to construct knowledge bases form the world wide web
1Learning to construct knowledge bases form the
world wide web
- Yang Lu
- luyang_at_cse.msu.edu
- Department of Computer Science and Engineering
- Michigan State University
- Advisor Dr Sakti Pramanik
2Learning to construct knowledge bases form the
world wide web
- Goal of the project
- Automatically create a computer understandable
knowledge base whose content mirrors that of the
world wide web.
3The problem
- Huge amount of web pages.
- Classification of web pages.
- Relations among the web pages.
- Machine understandable.
4An example
5How to construct and maintain the knowledge base
- Using machine learning to create information
extraction methods for each of the desired types
of knowledge. - Applying the learned information extraction
methods to extract symbolic, probabilistic
statements direct from the web hypertext.
6The WebKB system overview
- The system is first trained to extract
information of the desired types, then it is
allowed to browse new web sites to extract a new
knowledge base.
7Training the WebKB system
- The training data
- A specification of the classes and the relations
of interest. - Training examples that describe instances of the
ontology classes and relations.
8Example of training data
- Classes person, student, course,
research_project - Relations
- advisor_of( student, faculty)
- The samples are hand labeled.
9Assumption due to the variation nature of the web
- How class instance are described on the web
- Each instance of an ontology class is
represented by one or more segments of hypertext
on the web.( a single webpage, continuous text
within a page or a root graph of several pages
linked by hyperlinks.) - How relation instances are described on the web
- An undirected path of hyperlinks
- A segment of text representing A that contains
the segment that represents B - Hypertext segment for A satisfies some learned
model for relatedness to B.
10An example from the paper
11Recognize class instance
- Statistical bag-of-words method
- Ignore the sequence in which the words occur.
- Assume word occurrence is isolated.
12Recognize class instance
- Estimate the word probability
- Need to be robust for infrequent words.
- Estimate the vocabulary
- Limit the vocabulary size to 2000 words.
13Recognize class instance
14Recognize class instance
- Results
- Top vocabulary list for each class
15Recognize class instance
- Results
- Top vocabulary list for each class
16Recognize class instance
- Results
- Accuracy/coverage of full text
17Recognize class instance
- Results
- Accuracy/coverage of hyperlink
18Recognize class instance
- Results
- Accuracy/coverage of title/heading
19Recognize class instance
- First-order text classification
- Learning to classify web pages by using a learner
which is able to induce first-order rules - Using FOIL algorithm.
- Example A page is a course homepage if it
contains the words textbook and TA and is linked
to a page that contains the word assignment.
20Recognize class instance
- First-order text classification
- Apply the FOIL 6.4 with default setting to learn
a set of clauses consisting the classes mentioned
above except other. - other is the default class.
21Recognize class instance
- First-order text classification
- Learned rule classifier example
- Explanation of the rule
- The page has the word instructor, but does not
have the word good. - The page contains a hyperlink to a page which
does not contain any hyperlink to other pages. - This linked page contains the word assign.
22Recognize class instance
- Combine the statistical classifier and the
first-order rule based classifier - Using a simple voting algorithm
23Recognize relation instance
- Recognize relations of interest that exist among
extracted class instances. - Relation among class instances are represented by
hyperlink paths in the web.
24Recognize relation instance
- Using FOIL algorithm to derive the relations.
- Two step process
- Path finding
- Rule finding
25Recognize relation instance
- Path finding process an example
- Rule finding process FOIL algorithm
26Extract text fields
- Using SRV algorithm
( sequence rules with validation ) - SRV is a first-order rule learner based on FOIL.
- Input is a set of web pages.
- Output is a set of information extraction rules.
- Work on the HTML codes and tried to match a
sequence of tokens.
27Conclusion
- Advantages
- Automatically extraction.
- Good performance.
- Disadvantages
- Need training sample.
- Algorithm needs improvement.
28Unsupervised information extraction
- Soft matching patterns
- Represent patterns by
- Combining lexical tokens alongside
part-of-speech classes and punctuations. - Adopting a probabilistic framework that combines
slot content and sequential fidelity in computing
the degree of pattern match.
29Unsupervised information extraction
- Tokens for the soft matching patterns
30Unsupervised information extraction
- Example of generalizing soft patterns
31Unsupervised information extraction
- Pattern matching algorithm
- Cosine similarity.
- Express the soft pattern as following
- ltSlot-w, , Slot-2, Slot-1, SCH_TERM , Slot1,
Slot2, Slotw Pagt - lt(tokeni1, weighti1), (tokeni2, weighti2)
(tokenim, weightim) Slotigt