Learning to Extract Symbolic Knowledge from the World Wide Web - PowerPoint PPT Presentation

About This Presentation
Title:

Learning to Extract Symbolic Knowledge from the World Wide Web

Description:

Learning to Extract Symbolic Knowledge from the World Wide Web. Changho Choi ... about the mapping between the ontology and the Web. 1. Each instance of an ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 13
Provided by: cedarB
Category:

less

Transcript and Presenter's Notes

Title: Learning to Extract Symbolic Knowledge from the World Wide Web


1
Learning to Extract Symbolic Knowledge from the
World Wide Web
  • Changho Choi
  • Source http//www.cs.cmu.edu/knigam/
  • Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew
    McCallum
  • Carnegie Mellon University, J.Stefan Institute
  • AAAI-98

2
Abstract
Information on the Web
Unstandable to Human
Knowledgable
????
Extract information
KB
3
Introduction (1/4)
  • Two types of inputs
  • of the information extraction system
  • Ontology
  • Specifying the classes and relations of interest
  • For example, a hierarchy of classes including
    Person, Student, Research.Project, Course, etc.
  • Training examples
  • Represent instances of the ontology classes and
    relations
  • For example, a course web page for Course
    classes, faculty web pages for Faculty classes,
    this pair of pages for Courses.Taught.By, etc.

4
(No Transcript)
5
Introduction (3/4)
  • Assumptions
  • about the mapping between the ontology and the
    Web
  • 1. Each instance of an ontology class is
  • a single Web page,
  • a contiguous string of text,
  • or a collection of several Web pages.
  • 2. Each instance of a relation is
  • a segment of hypertext,
  • a contiguous segment of text,
  • or t he hypertext segment.

6
Introduction (4/4)
  • Three primary learning tasks
  • Involved in extracting knowledge-base instances
    for the Web
  • 1. Recognizing class instances by classifying
    bodies.
  • 2. Recognizing relation instances by classifying
    chains of hyperlinks.
  • 3. Recognizing class and relation instances by
    extracting small fields of text form Web pages.

7
Experimental Testbed
  • Experiments
  • Based on the ontology
  • ClassesDepartment, faculty, staff, student,
    research_project, course, other
  • Relations Instructors.Of.Course(251),
    Members.Of.Project(392), Department.Of.Person(748)
  • Data sets
  • A set of pages(4127) and hyperlinks(10945) from 4
    CS dept.
  • A set of pages(4120) from numerous other CS dept.
  • Evaluation
  • Four-fold cross validation
  • 3 for training, 1 for testing

8
Statistical Text Classification
  • Process
  • building a probabilistic model of each class
    using labeled training data
  • Classifying newly seen pages by selecting the
    class that that is most probable given the
    evidence of words describing the new page.
  • Train three classifiers
  • Full-text
  • Title/Heading
  • Hyperlink

9
Statistical Text Classification
  • Approach
  • the naïve Bayes, with minor modifications
  • Based on Kullback-Leibler Divergence
  • Given a document d to classify, we calculate a
    score for each class c as follows

10
Statistical Text Classification
  • Experimental evaluation

ActualPredicted course student faculty staff Research_project department other Accuracy
Course 202 17 0 0 1 0 552 26.2
Student 0 421 14 17 2 0 519 43.3
Faculty 5 56 118 16 3 0 264 17.9
Staff 0 15 0 4 0 0 45 6.2
Research_project 8 9 10 5 62 0 384 13.0
Department 10 8 3 1 5 4 209 1.7
Other 19 32 7 3 12 0 1064 93.6
Coverage 82.8 72.4 77.1 8.7 72.9 100.0 35.0
11
Accuracy/coverage
  • Coverage
  • The percentage of pages for a given class that
    are correctly classified as belonging to the
    class
  • accuracy
  • The percentage of pages classified into a given
    class that are actually members of that class

12
Accuracy/coverage tradeoff
1. Full-text classifiers
2. Hyperlink classifiers
3. Title/heading classifiers
Hyperlink information can provide strong
knowledge.
Write a Comment
User Comments (0)
About PowerShow.com