Title: Information Extraction and Ontology Learning Guided by Web Directory
1Information Extraction and Ontology Learning
Guided by Web Directory
- Authors Martin Kavalec
- Vojtech Svátek
- Presenter Mark Vickers
2Outline
- Introduction
- Mining Indicator Terms
- Integrating Rainbow
- Ontological Analysis of Web Directories
- IE and Ontology Learning
- Future Work
- Related Work
- Assessment
3Introduction
- Goal
- to extract information about (mostly generic)
products, services and areas of competence of
companies, from the free text chunks embedded in
web presentations. - Taking advantage of
- Collections of extraction patterns
- Ontologies of problem domains
- Approach Combine Information Extraction With
Ontologies - Ontologies can improve quality of IE
- Extracted information can improve/extend
ontologies - Bootstrapping
4Introduction
- Uses Open Directory (http//dmoz.org)
- Obtain labeled training data
- Lightweight ontologies
The Open Directory Project is the largest, most
comprehensive human-edited directory of the Web.
5(No Transcript)
6Mining Indicator Terms
- Informative terms generic names of products
- Indicator terms situated near informative terms
- Example our assortment includes
- in our shop you can buy
- Assumption Directory headings coincide with
informatives - Purpose Generate extraction patterns based on
Indicator terms - They use deeper linguistic techniques
7Mining Indicator Terms
- Example
- /Manufacturing/Materials/Metals/Steel/
Informative terms
- Match headings with text pages to find sentences
containing informative terms - Grab nearby words as indicator terms
- Generate extraction patterns from indicator terms
8Mining Indicator Terms
- Choosing Indicator Terms
- Syntactical analysis Link Grammar Parser
- Chose verbs occurring closest in parse tree to
informative word - Arrange verbs into a frequency table
- Order by ratio of frequency near informative term
to frequency in general - Chose 8 most promising verbs
9Mining Indicator Terms
- Preliminary Testing
- Sampled 14,500 sentences containing heading terms
- Randomly chose 130 sentences with indicators
- Manually labeled to estimate if informative term
was present or not - Example
- We are equipped to run any grade of corrugated
from E-flute to Triplewall, including all
government grades.
10Mining Indicator Terms
Coverage
Non-Filtered 10 20
Pre-Filtered 70 80
11Integration into Rainbow
- RAINBOW
- (Reusable Architecture for INtelligent Brokering
Of Web information access) -
- Web Analysis Tasks
- Sentence Extraction
- Explicit Metadata
- HTML Structure
- Inline Image
- Link Topology Structure
- Page Similarity
- Internal Communication based on SOAP
- Will use ontologies for verifying semantic
consistency of web services provided within the
distributed system
12Integration into Rainbow
- Rainbow will help solve coverage problem of
directory links pointing to barren pages - Using Analysis of
- Keywords and HTML Structure on start-up pages
- URLs of embedded links
- Metadata Extractor will be navigated towards
promising pages. - Looking for about-us or profile to find more
syntactically correct text, for example.
13Ontological Analysis of Web Directories
-Industries - Construction_and_Maintenance -
Materials_and_supplies - Masonry_and_Stone
- Natural_Stone - International_Sources
- Mexico
- Terms and Phrases in single heading belong to a
small set of classes - Parent-child relations belong to particular
classes corresponding to deep ontological
relations.
14Ontological Analysis of Web Directories
Class-subclass Relations
Class
Named Relations
Reflexive Binary Relations
- Meta-ontology of directory headings
15Ontological Analysis of Web Directories
16IE and Ontology Learning
- Extracting with plain indicator terms with simple
heuristics works - But Even Better
- Learn indicators for each class
- Use ontology analysis to classify indicators
found - Fill in database templates true IE
17IE and Ontology Learning
Closed Loop Strategy
Learn class-specific indicators
Classify Headings
Human Classifies Directory Headings (WordNet)
18Future Work
- Complete the Information extraction ontology
learning loop. - With relation to Semantic Web, they want to adapt
technique to the standards of usual explicit
metadata - Example The information extracted can be forged
to RDF triples, with indicator collections
accessible over the web
19Related Work
- Combining IE and Ontologies (without use of web
directories) - Bootstrapping an Ontology-Based Information
Extraction Systems - Advantages of using Link Grammar Parser
- Learning to Generate Semantic Annotation for
Domain Specific Sentences - Using Yahoo to classify whole documents
- Turning Yahoo into an Automatic Web-Page
Classifier - Similar work aimed at more structured information
using search engines - Extracting Patterns and Relations form the World
Wide Web - Bootstrapping and other statistical methods for
IE - Text Classification by Bootstrapping with
Keywords - Learning Dictionaries of Information Extraction
by Multi-Level Bootstrapping
20Assessment
- I dont think indicator term learning is done
(even though they say it is) - Counts on not yet decided Ontology learning
techniques - Need to develop an official directory