Web Information Extraction using a Search Engine presentation

About This Presentation

Transcript and Presenter's Notes

Title: Web Information Extraction using a Search Engine

1
Web Information Extraction using a Search Engine
Gijs Geleijnse
2
Outline

Introduction
Turing
Beyond Turing

3
Web Information Extraction

Information Extraction is the task of identifying
entities and their relations in natural language
texts.
Information Extraction is easy when we UNDERSTAND
documents.
Document Understanding is AI complete
computers need to be made as intelligent as
people.

4
Information Extraction

Gijs eats his wok dish with chop sticks,
while Dragan has his soup with vinegar.
We need grammar, precise semantics, deal with
ambiguities, make interpretations, read between
the lines etc.

5
Information Extraction

Gijs eats his wok dish with chop sticks,
while Natasa has his soup with vinegar.
We need grammar, precise semantics, deal with
ambiguities, make interpretations, read between
the lines etc.

6
Why Web Information Extraction?

To be really intelligent, applications need to be
world wise.
Hence, they need interpretable information
Given some information demand, the web seems a
good place to look
We are interested in structured, machine
interpretable information

7
Web Information

Why use unstructured texts?
Get opinions from social websites
Get facts from Wikipedia
Web Information Extraction is valuable when
We create knowledge that can not be extracted
from e.g. Wikipedia or Last.fm

8
Web Information Extraction

Find relevant texts
Identify entities
Identify relations
Problems
- No consistent newspaper-like texts
No training sets
Various languages
Typos, jokes, rubbish
Solution
Keep it simple
Explore Redundancy

9
Redundancy

Redundancy of information does the trick
We can assume that important concepts are
mentioned on multiple pages
Because of the expected redundancy we do not have
to recognize each occurrence
Simple techniques just might work

10
Historical persons
Assume given biographical characterizations
11
Pattern-based approach

Concept has been successfully applied in various
studies
Simple and effective
Precise queries lead to highly relevant snippets
Amsterdam is the capital of
We get relation for free
Amsterdam is the capital of the Netherlands
the Netherlands is a country
has Amsterdam as its capital

12
Historical persons
Querying with gendersis not the best choice
13
Query periods
person ( period )is best pattern.
14
1912 - 1954
person was
15
Beyond Turing

How to identify entities?
How to find relation patterns?
How to process this data?

16
Identifying Entities/Instances/Terms

Basically two alternatives
- Set of Rules
- requires some intelligence
- Machine Learning
- requires training set

17
Learning Patterns

Queries need to give many useful results
A pattern needs to be
- Precise (to give high quality results)
Occur frequently (to get (m)any results)
not too narrow (not only applicable for one pair)
Use of training set of instance pairs to learn
the patterns.
We want to perform as few queries as possible to
learn the patterns.
Bootstrapping learned relations can be used to
learn patterns.

18
Use Instances found in queries
19
Use Instances found in queries
starring Marlon Brando
20
Retrieving patterns
We formulate queries with the elements in the
trainingset allintext Michael Jackson
Thriller allintext Thriller Michael Jackson
We retrieve all inner-sentence fragments
between the instances and normalize them (remove
punctuation marks and capitals).
21
Retrieving relation patterns
We now have a (long) list of patterns album by
artist artists album album
album cover by artist album di artist
artist-cd album ......... Now to compute
scores frequency, precision, wide-spreadness
22
Retrieving relation patterns
Frequency we take the frequency of the pattern
in the list obtained. Precision - we google
the pattern in combination with an
instance - observe the fraction of useful results
e.g. if we google ABBAs new album we divide
the number excerpts with an album title by the
total number of excerpts found
23
Evaluate relation patterns
Wide-spreadness we count the number of different
instances found with the query. Score freq
prec spr We only compute the scores of the N
most frequent patterns. Number of queries 2
training set N instance set
24
Experiment hyponym patterns
Are Hearst patterns Hea92 indeed the most
effective patterns for the is-a relation?
(O ((country, hynonym), (all countries,
country,countries), is_a,
(Afghanistan,country), (Afghanistan,
countries), (Akrotiri, country),
(Akrotiri, countries), ...))
25
Case-study Hearst Patterns

Both the common Hearst Patterns and relations
typical for this setting (countries) perform
well. Method to select the most effective
hyponym patterns.
26
Case-study Burger King
TREC QA question In which countries can Burger
King be found? O ((country, restaurant), (all
countries, McDonalds, KFC), located_in,
(McDonalds, USA), (KFC, China), ...))
27
Case-study Burger King
We first find patterns using the method
described
28
How to recognize a hamburger chain?

We know where to look for them
Capitalized Words
Query restaurants like X and gives enough
hits.

29
Instances learned
Learned while evaluating the patterns identified

Restaurants
Cuisines
Stopwords
Geography

30
Case-study Burger King

Finally, we use the patterns found in combination
with Burger King to find relations.
- Precision 80
Recall 85
Most errors due to countries in which Burger King
plans to open restaurants.

31
Processing Data

Okay, we now presented a pattern-based method to
extract information on the web
How to use the method to identify new
information?
If you know the instances, alternatives exists?

32
(No Transcript)
33
hard rock
folk
folk
country
camp
classical
pop
rap
rap
folk
soul
pop
pop
34
Extracting Community-based meta-data from the Web

We combine evidence from multiple web sources
Does the Web community agree with Last.fm
community?

35
Overview

Problem description
Three alternative methods to find co-occurrences
Using instance similarities to improve mapping
Experimental results tagging music artist with
genres and painters with art movement
Conclusions

36
Two Problems

A a set of instances (artists/painters/..)
L a set of labels/tags/genres/
1. Find the most appropriate mapping m(a) ? L for
each instance a ? A .
2. Find a score t(a,b) for each pair expressing
relatedness between each pair (a,b) ? A x A.

37
Using co-occurrences for the mapping

We use the Web to find co-occurrences of artists
and labels.
If two terms co-occur relatively often in the
same context, we can consider them to be related
How to find co-occurrences / which context to
choose?
(b) How to use them to find a mapping m?

38
Finding artist similarities

Three alternatives to find web co-occurrences
between a and l using
the number of hits
patterns
full documents.

39
Co-occurrences using Google hits
co(abba,disco) 2,780,000 co(abba,hard
rock) 625,000 Use of additional terms can
specify the query.
Google Complexity A x L queries.
40
Co-occurrences using patterns
Take a text fragment that expresses the relation
between a label and an artist,
label artists such as artist
take a label or an artist
country artists such as
artists such as Johnny Cash
and Google!
41
Co-occurrences using patterns

The relation can be specified
Search for terms in the excerpts
- only A L queries

42
Co-occurrences using full documents

- Google considers these pages very relevant to
the query (Britney Spears)
The tags on these pages will probably reflect
her.
- again only O(A L) queries

43
Query country music
44
Using these co-occurrences
Using relative frequencies
45
Same trick to find similar artists

Same three approaches to find co-occurrences
between the artists
the number of hits ? A (A
L) queries
patterns ? O(A L) queries
full documents ? A L queries

46
Using artist similarity to improve mapping

We can use the computed relatedness between
artists to improve the classification.

country
country
country
47
The final mapping
We take an artist and its k nearest neighbors. We
do a majority voting among m(a).
country
country
country
48
Experiments

Which of the three methods performs best?
Does the use of artist similarity improve the
mapping? .. Or how to choose k ?
We only evaluate precision (if we find nothing,
its wrong)

49
Classifying artists into Genres JKU dataset

224 artists divided over 14 genres
publicly available dataset
previous work on this data set focused on
clustering the artists (e.g. Knees et.al. 2004,
Schedl et.al. 2005)

50
Genres JKU dataset
Documents Patterns Google hits
51
Comparing with Last.fm
52
Classifying painters into movements

Experiment conducted on
1,280 painters (en.wikipedia.org/List_of_painters
)
- 77 art movements (List_of_art_movements)
Evaluation
We visited the pages describing the art
movements.
Ground truth painters mentioned on 1 of these
pages.
Leads to set of 160 painter-movement pairs.

53
Classifying painters into movements
54
Classifying painters into movements
55
Last slide Conclusions

Presented a method to gather community-based
meta-data
Good methods require low Google Complexity
Experimental results are encouraging
Web Information Extraction is fun
Papers available via http//gijsg.dse.nl

56
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Web Information Extraction using a Search Engine PowerPoint PPT Presentation