Title: Learning to Extract Symbolic Knowledge from the World Wide Web
1Learning to Extract Symbolic Knowledge from the
World Wide Web
- Changho Choi
- Source http//www.cs.cmu.edu/knigam/
- Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew
McCallum - Carnegie Mellon University, J.Stefan Institute
- AAAI-98
2Abstract
Information on the Web
Unstandable to Human
Knowledgable
????
Extract information
KB
3Introduction (1/4)
- Two types of inputs
- of the information extraction system
- Ontology
- Specifying the classes and relations of interest
- For example, a hierarchy of classes including
Person, Student, Research.Project, Course, etc. - Training examples
- Represent instances of the ontology classes and
relations - For example, a course web page for Course
classes, faculty web pages for Faculty classes,
this pair of pages for Courses.Taught.By, etc.
4(No Transcript)
5Introduction (3/4)
- Assumptions
- about the mapping between the ontology and the
Web - 1. Each instance of an ontology class is
- a single Web page,
- a contiguous string of text,
- or a collection of several Web pages.
- 2. Each instance of a relation is
- a segment of hypertext,
- a contiguous segment of text,
- or t he hypertext segment.
6Introduction (4/4)
- Three primary learning tasks
- Involved in extracting knowledge-base instances
for the Web - 1. Recognizing class instances by classifying
bodies. - 2. Recognizing relation instances by classifying
chains of hyperlinks. - 3. Recognizing class and relation instances by
extracting small fields of text form Web pages.
7Experimental Testbed
- Experiments
- Based on the ontology
- ClassesDepartment, faculty, staff, student,
research_project, course, other - Relations Instructors.Of.Course(251),
Members.Of.Project(392), Department.Of.Person(748)
- Data sets
- A set of pages(4127) and hyperlinks(10945) from 4
CS dept. - A set of pages(4120) from numerous other CS dept.
- Evaluation
- Four-fold cross validation
- 3 for training, 1 for testing
8Statistical Text Classification
- Process
- building a probabilistic model of each class
using labeled training data - Classifying newly seen pages by selecting the
class that that is most probable given the
evidence of words describing the new page. - Train three classifiers
- Full-text
- Title/Heading
- Hyperlink
9Statistical Text Classification
- Approach
- the naïve Bayes, with minor modifications
- Based on Kullback-Leibler Divergence
- Given a document d to classify, we calculate a
score for each class c as follows
10Statistical Text Classification
ActualPredicted course student faculty staff Research_project department other Accuracy
Course 202 17 0 0 1 0 552 26.2
Student 0 421 14 17 2 0 519 43.3
Faculty 5 56 118 16 3 0 264 17.9
Staff 0 15 0 4 0 0 45 6.2
Research_project 8 9 10 5 62 0 384 13.0
Department 10 8 3 1 5 4 209 1.7
Other 19 32 7 3 12 0 1064 93.6
Coverage 82.8 72.4 77.1 8.7 72.9 100.0 35.0
11Accuracy/coverage
- Coverage
- The percentage of pages for a given class that
are correctly classified as belonging to the
class - accuracy
- The percentage of pages classified into a given
class that are actually members of that class
12Accuracy/coverage tradeoff
1. Full-text classifiers
2. Hyperlink classifiers
3. Title/heading classifiers
Hyperlink information can provide strong
knowledge.