Kein Folientitel

About This Presentation

Title:

Kein Folientitel

Description:

... Jaro-Winkler Soundex suitable for short strings with spelling mistakes Hybrids ... What is the sound of one ... Ore. Microsoft's central ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 51

Provided by: est138

Category:

more less

Transcript and Presenter's Notes

Title: Kein Folientitel

1
Information Extraction
Martin Ester Simon Fraser University School of
Computing Science CMPT 884 Spring 2009
2
Information Extraction

Outline
Introduction motivation, applications, issues
Entity extraction hand-coded, machine
learning
Relation extraction supervised, partially
supervised
Entity resolution string similarity, finding
similar pairs, creating groups
Future research
? Feldman 2006 Agichtein Sarawagi 2006

3
Introduction

Motivation
80 of all human-generated data is natural
language text
search engines return whole documents, requiring
the user to read documents and manually
extract relevant information (entities, facts,
. . .) ? very time-consuming
need for automatic extraction of such
information from collections of natural language
text documents ? information extraction (IE)

4
Introduction

Definitions
Entity an object of interest such as a person
or organization.
Attribute a property of an entity such as its
name, alias, descriptor, or type.
Relation a relationship held between two or
more entities such as Position of a Person in
a Company.
Event an activity involving several entities
such as a terrorist act, aircraft crash,
management change, new product introduction.

5
Introduction

Example

6
Introduction

Applications
question answering Who is the president of
the US? Where was Martin Luther born?
automatic creation of databases
e.g., database of protein localizations
or adverse reactions to a drug
opinion mining analyzing online product
reviews to get user feedback

7
Introduction

Challenges
Complexity of natural language e.g.,
identifying word and sentence boundaries is
fairly easy in European languages, much
harder in Chinese / Japanese
Ambiguity of natural language e.g., homonyms
Diversity of natural language
many ways of expressing a given information,
e.g. synonyms
Diversity of writing styles
e.g., scientific papers, newspaper articles,
maintenance reports, emails, . . .

8
Introduction

Challenges
names are hard to discover
impossible to enumerate
new candidates are generated all the time
hard to provide syntactic rules
types of proper names
people
companies
products
genes - . . .

9
Introduction

Architecture of IE System

Local analysis
Discourse (global) analysis
10
Introduction

Knowledge Engineering Approach
Extraction rules are hand-crafted by linguists in
cooperation with domain experts.
Most of the work is done by inspecting a set of
relevant documents.
Development of rule set is very time-consuming.
Requires substantial CS and domain expertise.
Rule sets are domain-specific, do not transfer
to other domains.
Knowledge engineering (KE) approach often
achieves higher accuracy than machine
learning approach.

11
Introduction

Machine Learning Approach
Automatically learn model (rules) from
annotated training corpus.
Techniques based on pure statistics and little
linguistic knowledge.
No CS expertise required when building model.
However creating the annotated corpus is very
laborious, since very large number of
training examples needed.
Transfer to other domains is easier than KE
approach.
Accuracy of machine learning (ML) approach is
typically lower.

12
Introduction

Topics Not Covered
co-reference resolution e.g., article
referencing a noun (entity) of another sentence
event extraction event has type, actor, time .
. .
sentiment detection a certain statement
(opinion) is classified as positive / negative

13
Entity Extraction

Lexical Analysis
breaking up the input document into individual
words tokens
token sequence of characters treated as a unit
punctuation marks also considered as
token e.g., , (comma)
often, use regular expressions to define format
of token

14
Entity Extraction

Syntactic Analysis
part-of-speech tagging Charniak 1997
marking up the tokens in a text as
corresponding to a particular part of speech
(POS), based on both its definition, as well as
its context
coarse POS tags e.g., N, V, A, Aux, .
finer POS tags - PRP personal pronouns
(you, me, she, he, them, him, her, ) - PRP
possessive pronouns (my, our, her, his, ) -
NN singular common nouns (sky, door, theorem,
) - NNS plural common nouns (doors,
theorems, women, ) - NNP singular proper
names (Fifi, IBM, Canada, ) - NNPS plural
proper names (Americas, Carolinas, )

15
Entity Extraction

Syntactic Analysis
Words often have more than one POS, e.g. back
The back door JJ
On my back NN
Win the voters back RB
Promised to back the bill VB
The POS tagging problem is to determine the POS
tag for a particular instance of a word.
e.g., input the lead paint is unsafe
output the/Det lead/N paint/N is/V unsafe/Adj

16
Entity Extraction

Knowledge Engineering Approach Chaudhuri 2005
hand-coded rules often relatively
straightforward
easy to incorporate domain knowledge
require substantial CS expertise
example rule lttokengt INITIALlt/tokengt
lttokengtDOT lt/tokengt
lttokengtCAPSWORDlt/tokengt
lttokengtCAPSWORDlt/tokengt ? finds person names
with a salutation and two capitalized
words, e.g. Dr. Laura Haas

17
Entity Extraction

Knowledge Engineering Approach
a more complex example conference
namewordOrdinals"(?firstsecondthirdfourthf
ifthsixthseventheighthninthtentheleventhtwe
lfththirteenthfour teenthfifteenth)"
my numberOrdinals"(?\\d?(?1st2nd3rd1th2th
3th4th5th6th7th8th9th0th))"
my ordinals"(?wordOrdinalsnumberOrdinals)"
my confTypes"(?ConferenceWorkshopSymposium)"
my words"(?A-Z\\w\\s)" A word starting
with a capital letter and ending with 0 or more
spaces
my confDescriptors"(?international\\sA-Z\\
s)" .e.g "International Conference ...' or
the conference
name for workshops (e.g. "VLDB Workshop ...")
my connectors"(?onof)"
my abbreviations"(?\\(A-Z\\w\\w\\W\\s?(?
\\d\\d)?\\))" abbreviations like
"(SIGMOD'06)"
my fullNamePattern"((?ordinals\\swordscon
fDescriptors)?confTypes(?\\sconnectors\\s.?
\\s)?abb reviations?)(?\\n\\r\\.lt)" . . .

18
Entity Extraction

Machine Learning Approach
We can view the named entity extraction as a
sequence classification problem classify
each word as belonging to one of the named
entity classes or to the noname class.
Class label of sequence element depends on
neighboring ones.
One of the most popular techniques for dealing
with classifying sequences is Hidden Markov
Models (HMM).
Other popular ML method for entity extraction
Conditional Random Fields Lafferty et al
2001.
Requires large enough labeled (annotated)
training dataset.

19
Entity Extraction

Hidden Markov Models Rabiner 1989
HMM (Hidden Markov Model) is a finite state
automaton with stochastic state transitions
and symbol emissions.
The automaton models a probabilistic generative
process.
In this process a sequence of symbols is
produced by starting in an initial state,
transitioning to a new state, emitting a
symbol selected by the state and repeating this
transition/emission cycle until a designated
final state is reached.
Very successful in many sequence classification
tasks.

20
Entity Extraction

Example
HMM for addresses

21
Entity Extraction

Hidden Markov Models
T length of the sequence of observations
(training set)
N number of states in the model
qt the actual state at time t
S S1,...SN (finite set of possible states)
V O1,...OM (finite set of observation
symbols)
p pi P(q1 Si) starting probabilities
A aijP(qt1 Si qt Sj) transition
probabilities
B bi(Ot) P(Ot qt Si) emission
probabilities
? (p, A, B) hidden Markov model

22
Entity Extraction

Hidden Markov Models
How to find P( O ? ) the probability of an
observation sequence given the HMM model? ?
forward-backward algorithm
How to find ? that maximizes P( O ? )? This
is the task of the training phase. ? Baum-Welch
algorithm
How to find the most likely state trajectory
given ? and O?
This is the task of the test phase. ? Viterbi
algorithm

23
Relation Extraction

Example

Organization
Location
Microsoft's central headquarters in Redmond is
home to almost every product group and division.
Microsoft Apple Computer Nike
Redmond Cupertino Portland
Brent Barlow, 27, a software analyst and
beta-tester at Apple Computer headquarters in
Cupertino, was fired Monday for "thinking a
little too different."
Apple's programmers "think different" on a
"campus" in Cupertino, Cal. Nike employees "just
do it" at what the company refers to as its
"World Campus," near Portland, Ore.
24
Relation Extraction

Introduction
No single source contains all the relations
Each relation appears on many web pages
There are repeated patterns in the way
relations are represented on web pages ?
exploit redundancy
Components of relation appear close
together ? use context of occurrence of relation
to determine patterns
pattern consists of constants (tokens) and
variables (placeholders for entities)
tuple instance / occurrence of a relation

25
Relation Extraction

Introduction
Typically requires entity extraction (tagging)
as preprocessing
Knowledge engineering approach
- patterns defined over lexical items
ltcompanygt located in ltlocationgt
- patterns defined over parsed text
((Obj ltcompanygt) (Verb located) ()
(Subj ltlocationgt))
Machine learning approach
- learn rules/patterns from examples
- partially-supervised bootstrap from example
tuples
Agichtein Gravano 2000, Etzioni et
al 2004

26
Relation Extraction

Snowball Agichtein Gravano 2000
Exploit duality between patterns and tuples
- find tuples that match a set of patterns
find patterns that match a lot of tuples?
bootstrapping approach

Initial Seed Tuples
Occurrences of Seed Tuples
Tag Entities
Generate New Seed Tuples
Generate Extraction Patterns
Augment Table
27
Relation Extraction

Snowball
how to represent patterns of occurrences?

Computer servers at Microsofts headquarters in
Redmond In mid-afternoon trading, share
ofRedmond-based Microsoft fell The Armonk-based
IBM introduceda new line The combined company
will operate from Boeings headquarters in
Seattle. Intel, Santa Clara, cut prices of
itsPentium processor.
initial seed tuples
occurrences of seed tuples
28
Relation Extraction

Patterns
(extraction) pattern has format ltleft, tag1,
middle, tag2, rightgt,
where tag1, tag2 are named-entity tags and
left, middle, and right are vectors of weighted
terms
patterns derived directly from occurrences are
too specific

ORGANIZATION 's central headquarters in
LOCATION is home to...
lt's 0.5gt, ltcentral 0.5gt ltheadquarters 0.5gt, lt
in 0.5gt
ltis 0.75gt, lthome 0.75gt
LOCATION
ORGANIZATION
lt left , tag1 , middle , tag2 , right gt
29
Relation Extraction

Pattern Clusters
cluster patterns, cluster centroids define
patterns

30
Relation Extraction

Evaluation of Patterns
How good are new extraction patterns?
Measure their performance through their accuracy
vs. the initial seed tuples (ground truth).

Boeing, Seattle, said Positive Intel, Santa
Clara, cut prices Positive invest in Microsoft,
New York-based Negativeanalyst Jane Smith said
extraction with pattern ORGANIZATION, LOCATION
31
Relation Extraction

Evaluation of Patterns
Trust only patterns with high support and
confidence, i.e. that produce many correct
(positive) tuples and only a few false
(negative) tuples.
conf(p) pos(p)/(pos(p)neg(p)) where p
denotes a pattern and pos(p), neg(p) denote the
numbers of positive, negative tuples produced

32
Relation Extraction

Evaluation of Tuples
Trust only tuples that match many patterns.
Suppose candidate tuple t matches patterns p1
and p2. What is the probability that t is a
valid tuple?
Assume matches of different patterns are
independent events.
Prt matches p1 and t is not valid 1-conf(p1)
Prt matches p2 and t is not valid 1-conf(p2)
Prt matches p1,p2 and t is not valid
(1-conf(p1))(1-conf(p2))
Prt matches p1,p2 and t is valid 1 -
(1-conf(p1))(1-conf(p2))
If tuple t matches a set of patterns P
conf(t) 1 - ?p in P(1-conf(p))

33
Relation Extraction

Snowball Algorithm
1. Start with seed set R of tuples
2. Generate set P of patterns from R
compute support and confidence for each pattern
in P
discard patterns with low support or confidence
3. Generate new set T of tuples matching patterns
P
compute confidence of each tuple in T
add to R the tuples t in T with
conf(t)gtthreshold.
4. go back to step 2

34
Relation Extraction

Discussion
bootstrapping approach requires only a
relatively small number of training tuples
(semi-supervised)
is effective for binary, 11 relations
bootstrapping approach has been adopted by lots
of subsequent work
pattern evaluation is heuristic and has no theory
behind
? Statistical Snowball, WWW 09
what about n-ary relations?
what about 1m relations?

35
Entity Resolution

Introduction

36
Entity Resolution

Introduction
Entity resolution - map entity mentions to
the corresponding entities - entities stored in
database or ontology
Challenges
- large lists with multiple noisy mentions of
the same entity - no single attribute to order
or cluster likely duplicates while
separating them from similarbut different
entities - need to depend on fuzzy and
computationally expensive string similarity
functions.

37
Entity Resolution

Introduction
Typical approach - define string
similarity numeric attributes are easy to
compare, hard are string attributes needs
to perform approximate matches - find similar
pairs of entities - create groups from
duplicate entity pairs (clustering)

38
Entity Resolution

String Similarity
Token-based
Jaccard TF-IDF cosine similarities
? suitable for large documents
Character-based
Edit-distance and variants like Levenshtein,
Jaro-Winkler Soundex
? suitable for short strings with spelling
mistakes
Hybrids

39
Entity Resolution

Token-Based String Similarity
Tokens/words
ATT Corporation ? ATT , Corporation
Similarity various measures of overlap of two
sets S,T
Jaccard(S,T) SnT/S?T
Example
S ATT Corporation ? ATT , Corporation
T ATT Corp ? ATT , Corp.
Jaccard(S,T) 1/3
Variants weights attached with each token

40
Entity Resolution

Token-Based String Similarity
Sets transformed to vectors with each term as
dimension
Cosine similarity dot-product of two vectors
each normalized to unit length
? cosine of angle between them
Term weight TF/IDF log (tf1) log idf
where
tf frequency of term in a document d
idf number of documents / number of
documents containing term
? rare terms are more important

41
Entity Resolution

Token-Based String Similarity
Widely used in traditional IR
Example
ATT Corporation, ATT Corp or ATT Inc
low weights for Corporation,Corp,Inc,
higher weight for ATT

42
Entity Resolution

Character-Based String Similarity
Given two strings, S,T, edit(S,T)
minimum cost sequence of operations to transform
S to T.
Character operations I (insert), D (delete), R
(Replace).
Example edit(Error,Eror) 1, edit(great,grate)
2
Dynamic programming algorithm to compute edit()
Several variants (gaps,weights) ? becomes
NP-complete
Varying costs of operations can be learnt
Suitable for common typing mistakes on small
strings

43
Entity Resolution

Find Duplicate Pairs
Input a large list of entities with string
attributes
Output all pairs (S,T) of entities which
satisfy a similarity criteria such as
Jaccard(S,T) gt 0.7
Edit-distance(S,T) lt k
Naive method for each record pair, compute
similarity score
I/O and CPU intensive, not scalable to millions
of entities
Goal reduce O(n2) cost to O(nw), where w ltlt n
Reduce number of pairs on which similarity is
computed

44
Entity Resolution

Find Duplicate Pairs
Method filter and refinement
Use inexpensive filter to filter out as many
pairs as possible e.g. EditDistance(s,t) d
? q-grams(s) n q-grams(t) max(s,t) -
(d-1)q - 1
q-gram subsequence of q consecutive
characters e.g. 3-grams for ATT
Corporation AT,TT,T , T C, Co,
orp,rpo,por,ora,rat,ati,tio,ion
If a pair (s, t) does not satisfy the filter, it
cannot satisfy the similarity
criteria e.g., q-grams(s) n q-grams(t) lt
max(s,t) - (d-1)q - 1 ?
EditDistance(s,t) gt d

45
Entity Resolution

Find Duplicate Pairs
Do not have to apply the filter to all pairs of
entities use index to retrieve subset of
entities that share q-grams
Compute the expensive similarity function only
to pairs that survive the filter step e.g.
EditDistance(s,t)

46
Entity Resolution

Create Groups of Duplicates
Given pairs of duplicate entities
Group them such that each group corresponds to
one entity
Many clustering algorithms have been applied
Number of clusters hard to specify in advance
Ground truth may be available for some entity
pairs ? semi-supervised clustering

47
Entity Resolution

Create Groups of Duplicates
Agglomerative clustering repeatedly merge
closest clusters
Definition of closeness of clusters subject to
tuning Average/Max/Min similarity
Efficient implementations possible using special
data structures

48
Entity Resolution

Challenges
Collective entity resolution consider
relationships between entities and
propagate resolution decisions along these
relationships ? use Markov Logic Networks Parag
Domingos 2005
Mapping to existing background knowledge
ontology of real world entities may be given
map entities / clusters of entities to ontology
entries ? k-nearest neighbor methods

49
Information Extraction

References
Eugene Agichtein, Luis Gravano Snowball
Extracting Relations from Large Plain-Text
Collections, ACM DL, 2000
Eugene Agichtein, Sunita Sarawagi Scalable
Information Extraction and Integration,
Tutorial KDD 2006
Eugene Charniak Statistical Techniques for
Natural Language Parsing, AI Magazine 18(4),
1997
S. Chaudhuri, R. Ramakrishnan, and G. Weikum.
Integrating db and ir technologies What is
the sound of one hand clapping?, CIDR 2005
Ronen FeldmanInformation Extraction Theory and
Practice, Tutorial ICML 2006

50
Information Extraction

References
John Lafferty, Andrew McCallum, Fernando
Pereira Conditional Random Fields
Probabilistic Models for Segmenting and Labeling
Sequence Data, ICML 2001
L. R. Rabiner. A tutorial on hidden Markov
models and selected applications in speech
recognition, Proc. IEEE 77(2), 1989

Write a Comment

User Comments (0)