Web Information Extraction using a Search Engine - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Web Information Extraction using a Search Engine

Description:

Web Information Extraction using a Search Engine – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 57
Provided by: Gelei
Category:

less

Transcript and Presenter's Notes

Title: Web Information Extraction using a Search Engine


1
Web Information Extraction using a Search Engine
Gijs Geleijnse
2
Outline
  • Introduction
  • Turing
  • Beyond Turing

3
Web Information Extraction
  • Information Extraction is the task of identifying
    entities and their relations in natural language
    texts.
  • Information Extraction is easy when we UNDERSTAND
    documents.
  • Document Understanding is AI complete
    computers need to be made as intelligent as
    people.

4
Information Extraction
  • Gijs eats his wok dish with chop sticks,
  • while Dragan has his soup with vinegar.
  • We need grammar, precise semantics, deal with
    ambiguities, make interpretations, read between
    the lines etc.

5
Information Extraction
  • Gijs eats his wok dish with chop sticks,
  • while Natasa has his soup with vinegar.
  • We need grammar, precise semantics, deal with
    ambiguities, make interpretations, read between
    the lines etc.

6
Why Web Information Extraction?
  • To be really intelligent, applications need to be
    world wise.
  • Hence, they need interpretable information
  • Given some information demand, the web seems a
    good place to look
  • We are interested in structured, machine
    interpretable information

7
Web Information
  • Why use unstructured texts?
  • Get opinions from social websites
  • Get facts from Wikipedia
  • Web Information Extraction is valuable when
  • We create knowledge that can not be extracted
    from e.g. Wikipedia or Last.fm

8
Web Information Extraction
  • Find relevant texts
  • Identify entities
  • Identify relations
  • Problems
  • - No consistent newspaper-like texts
  • No training sets
  • Various languages
  • Typos, jokes, rubbish
  • Solution
  • Keep it simple
  • Explore Redundancy

9
Redundancy
  • Redundancy of information does the trick
  • We can assume that important concepts are
    mentioned on multiple pages
  • Because of the expected redundancy we do not have
    to recognize each occurrence
  • Simple techniques just might work

10
Historical persons
Assume given biographical characterizations
11
Pattern-based approach
  • Concept has been successfully applied in various
    studies
  • Simple and effective
  • Precise queries lead to highly relevant snippets
  • Amsterdam is the capital of
  • We get relation for free
  • Amsterdam is the capital of the Netherlands
  • the Netherlands is a country
  • has Amsterdam as its capital

12
Historical persons
Querying with gendersis not the best choice
13
Query periods
person ( period )is best pattern.
14
1912 - 1954
person was
15
Beyond Turing
  • How to identify entities?
  • How to find relation patterns?
  • How to process this data?

16
Identifying Entities/Instances/Terms
  • Basically two alternatives
  • - Set of Rules
  • - requires some intelligence
  • - Machine Learning
  • - requires training set

17
Learning Patterns
  • Queries need to give many useful results
  • A pattern needs to be
  • - Precise (to give high quality results)
  • Occur frequently (to get (m)any results)
  • not too narrow (not only applicable for one pair)
  • Use of training set of instance pairs to learn
    the patterns.
  • We want to perform as few queries as possible to
    learn the patterns.
  • Bootstrapping learned relations can be used to
    learn patterns.

18
Use Instances found in queries
19
Use Instances found in queries
starring Marlon Brando
20
Retrieving patterns
We formulate queries with the elements in the
trainingset allintext Michael Jackson
Thriller allintext Thriller Michael Jackson
We retrieve all inner-sentence fragments
between the instances and normalize them (remove
punctuation marks and capitals).
21
Retrieving relation patterns
We now have a (long) list of patterns album by
artist artists album album
album cover by artist album di artist
artist-cd album ......... Now to compute
scores frequency, precision, wide-spreadness
22
Retrieving relation patterns
Frequency we take the frequency of the pattern
in the list obtained. Precision - we google
the pattern in combination with an
instance - observe the fraction of useful results
e.g. if we google ABBAs new album we divide
the number excerpts with an album title by the
total number of excerpts found
23
Evaluate relation patterns
Wide-spreadness we count the number of different
instances found with the query. Score freq
prec spr We only compute the scores of the N
most frequent patterns. Number of queries 2
training set N instance set
24
Experiment hyponym patterns
Are Hearst patterns Hea92 indeed the most
effective patterns for the is-a relation?
(O ((country, hynonym), (all countries,
country,countries), is_a,
(Afghanistan,country), (Afghanistan,
countries), (Akrotiri, country),
(Akrotiri, countries), ...))
25
Case-study Hearst Patterns

Both the common Hearst Patterns and relations
typical for this setting (countries) perform
well. Method to select the most effective
hyponym patterns.
26
Case-study Burger King
TREC QA question In which countries can Burger
King be found? O ((country, restaurant), (all
countries, McDonalds, KFC), located_in,
(McDonalds, USA), (KFC, China), ...))
27
Case-study Burger King
We first find patterns using the method
described
28
How to recognize a hamburger chain?
  • We know where to look for them
  • Capitalized Words
  • Query restaurants like X and gives enough
    hits.

29
Instances learned
Learned while evaluating the patterns identified
  • Restaurants
  • Cuisines
  • Stopwords
  • Geography

30
Case-study Burger King
  • Finally, we use the patterns found in combination
    with Burger King to find relations.
  • - Precision 80
  • Recall 85
  • Most errors due to countries in which Burger King
    plans to open restaurants.

31
Processing Data
  • Okay, we now presented a pattern-based method to
    extract information on the web
  • How to use the method to identify new
    information?
  • If you know the instances, alternatives exists?

32
(No Transcript)
33
hard rock
folk
folk
country
camp
classical
pop
rap
rap
folk
soul
pop
pop
34
Extracting Community-based meta-data from the Web
  • We combine evidence from multiple web sources
  • Does the Web community agree with Last.fm
    community?

35
Overview
  • Problem description
  • Three alternative methods to find co-occurrences
  • Using instance similarities to improve mapping
  • Experimental results tagging music artist with
    genres and painters with art movement
  • Conclusions

36
Two Problems
  • A a set of instances (artists/painters/..)
  • L a set of labels/tags/genres/
  • 1. Find the most appropriate mapping m(a) ? L for
    each instance a ? A .
  • 2. Find a score t(a,b) for each pair expressing
    relatedness between each pair (a,b) ? A x A.

37
Using co-occurrences for the mapping
  • We use the Web to find co-occurrences of artists
    and labels.
  • If two terms co-occur relatively often in the
    same context, we can consider them to be related
  • How to find co-occurrences / which context to
    choose?
  • (b) How to use them to find a mapping m?

38
Finding artist similarities
  • Three alternatives to find web co-occurrences
    between a and l using
  • the number of hits
  • patterns
  • full documents.

39
Co-occurrences using Google hits
co(abba,disco) 2,780,000 co(abba,hard
rock) 625,000 Use of additional terms can
specify the query.
Google Complexity A x L queries.
40
Co-occurrences using patterns
Take a text fragment that expresses the relation
between a label and an artist,
label artists such as artist
take a label or an artist
country artists such as
artists such as Johnny Cash
and Google!
41
Co-occurrences using patterns
  • The relation can be specified
  • Search for terms in the excerpts
  • - only A L queries

42
Co-occurrences using full documents
  • - Google considers these pages very relevant to
    the query (Britney Spears)
  • The tags on these pages will probably reflect
    her.
  • - again only O(A L) queries

43
Query country music
44
Using these co-occurrences
Using relative frequencies
45
Same trick to find similar artists
  • Same three approaches to find co-occurrences
    between the artists
  • the number of hits ? A (A
    L) queries
  • patterns ? O(A L) queries
  • full documents ? A L queries

46
Using artist similarity to improve mapping
  • We can use the computed relatedness between
    artists to improve the classification.

country
country
country
47
The final mapping
We take an artist and its k nearest neighbors. We
do a majority voting among m(a).
country
country
country
48
Experiments
  • Which of the three methods performs best?
  • Does the use of artist similarity improve the
    mapping? .. Or how to choose k ?
  • We only evaluate precision (if we find nothing,
    its wrong)

49
Classifying artists into Genres JKU dataset
  • 224 artists divided over 14 genres
  • publicly available dataset
  • previous work on this data set focused on
    clustering the artists (e.g. Knees et.al. 2004,
    Schedl et.al. 2005)

50
Genres JKU dataset
Documents Patterns Google hits
51
Comparing with Last.fm
52
Classifying painters into movements
  • Experiment conducted on
  • 1,280 painters (en.wikipedia.org/List_of_painters
    )
  • - 77 art movements (List_of_art_movements)
  • Evaluation
  • We visited the pages describing the art
    movements.
  • Ground truth painters mentioned on 1 of these
    pages.
  • Leads to set of 160 painter-movement pairs.

53
Classifying painters into movements
54
Classifying painters into movements
55
Last slide Conclusions
  • Presented a method to gather community-based
    meta-data
  • Good methods require low Google Complexity
  • Experimental results are encouraging
  • Web Information Extraction is fun
  • Papers available via http//gijsg.dse.nl

56
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com