Lecture 16: Information Extraction presentation

About This Presentation

Transcript and Presenter's Notes

Title: Lecture 16: Information Extraction

1
Lecture 16 Information Extraction
Oct. 26, 2007 ChengXiang Zhai
Most slides are from Eugene Agichteins and
William Cohens tutorials
2
The Value of Text Data

Unstructured text data is the primary form of
human-generated information
Blogs, web pages, news, scientific literature,
online reviews,
Semi-structured data (database generated) see
Prof. Bing Lius KDD webinar http//www.cs.uic.ed
u/liub/WCM-Refs.html
The techniques discussed here are complimentary
to structured object extraction methods
Need to extract structured information to
effectively manage, search, and mine the data
Information Extraction mature, but active
research area
Intersection of Computational Linguistics,
Machine Learning, Data mining, Databases, and
Information Retrieval
Traditional focus on accuracy of extraction
Recently attention paid to scalability

3
Example Answering Queries Over Text
For years, Microsoft Corporation CEO Bill Gates
was against open source. But today he appears to
have changed his mind. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Select Name From PEOPLE Where Organization
Microsoft
PEOPLE
Name Title Organization Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman Founder Free
Soft..
Bill Gates Bill Veghte
(from William Cohens IE tutorial, 2003)
4
IE History Pre-Web

Mostly news articles
De Jongs FRUMP 1982
Hand-built system to fill Schank-style scripts
from news wire
Message Understanding Conference (MUC) DARPA
87-95, TIPSTER 92-96
Early work dominated by hand-built models
E.g. SRIs FASTUS, hand-built FSMs.
But by 1990s, some machine learning Lehnert,
Cardie, Grishman and then HMMs Elkan Leek 97,
BBN Bikel et al 98

5
IE History Web

AAAI 94 Spring Symposium on Software Agents
Much discussion of ML applied to Web. Maes,
Mitchell, Etzioni.
Tom Mitchells WebKB, 96
Build KBs from the Web.
Wrapper Induction
Initially hand-build, then ML Soderland 96,
Kushmeric 97,
Citeseer Cora FlipDog contEd courses,
corpInfo,
WebFountain (IBM)
KnowItAll (University of Washington)

6
IE History Other Domains

Biology
Gene/protein entity extraction
Protein/protein fact interaction
Automated curation/integration of databases
At CMU SLIF (Murphy et al, subcellular
information from images text in journal
articles)
At UIUC BeeSpace (http//www.beespace.uiuc.edu/)
Email
EPCA, PAL, RADAR, CALO intelligent office
assistant that understands some part of email
At CMU web site update requests, office-space
requests calendar scheduling requests social
network analysis of email.

7
Landscape of IE Tasks (1/4)Degree of Formatting
Text paragraphs without formatting
Grammatical sentencesand some formatting links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,rich formatting links
Tables
8
Landscape of IE Tasks (2/4)Intended Breadth of
Coverage
Web site specific
Genre specific
Wide, non-specific
Formatting
Layout
Language
Amazon.com Book Pages
Resumes
University Names
9
Landscape of IE Tasks (3/4)Complexity
E.g. word patterns
Regular set
Closed set
U.S. phone numbers
U.S. states
Phone (413) 545-1323
He was born in Alabama
The CALD main office can be reached at
412-268-1299
The big Wyoming sky
Ambiguous patterns,needing context andmany
sources of evidence
Complex pattern
U.S. postal addresses
Person names
University of Arkansas P.O. Box 140 Hope, AR
71802
was among the six houses sold by Hope Feldman
that year.
Pawel Opalinski, SoftwareEngineer at WhizBang
Labs.
Headquarters 1128 Main Street, 4th
Floor Cincinnati, Ohio 45210
10
Landscape of IE Tasks (4/4)Single Field/Record
Jack Welch will retire as CEO of General Electric
tomorrow. The top role at the Connecticut
company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
N-ary record
Person Jack Welch
Relation Person-Title Person Jack
Welch Title CEO
Relation Succession Company General
Electric Title CEO Out
Jack Welsh In Jeffrey Immelt
Person Jeffrey Immelt
Relation Company-Location Company General
Electric Location Connecticut
Location Connecticut
Named entity extraction
11
Landscape of IE Techniques (1/1)Models
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama Alaska Wisconsin Wyoming
Any of these models can be used to capture words,
formatting or both.
12
Hand-Coded Methods

Easy to construct in some cases
e.g., to recognize prices, phone numbers, zip
codes, conference names, etc.
Intuitive to debug and maintain
Especially if written in a high-level language
Can incorporate domain knowledge
Scalability issues
Labor-intensive to create
Highly domain-specific
Often corpus-specific
Rule-matches can be expensive

IBM Avatar
13
Machine Learning Methods

Can work well when lots of training data easy to
construct
Can capture complex patterns that are hard to
encode with hand-crafted rules
e.g., determine whether a review is positive or
negative
extract long complex gene names
Non-local dependencies

14
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
15
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
16
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
17
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
18
A Naïve Bayes Sliding Window Model
Freitag 1997
00 pm Place Wean Hall Rm 5409
Speaker Sebastian Thrun

w t-m
w t-1
w t
w tn
w tn1
w tnm
prefix
contents
suffix
Estimate Pr(LOCATIONwindow) using Bayes
rule Try all reasonable windows (vary length,
position) Assume independence for length, prefix
words, suffix words, content words Estimate from
data quantities like Pr(Place in
prefixLOCATION)
If P(Wean Hall Rm 5409 LOCATION) is above
some threshold, extract it.
19
BWI Learning to detect boundaries
Freitag Kushmerick, AAAI 2000

Another formulation learn three probabilistic
classifiers
START(i) Prob( position i starts a field)
END(j) Prob( position j ends a field)
LEN(k) Prob( an extracted field has length k)
Then score a possible extraction (i,j) by
START(i) END(j) LEN(j-i)
LEN(k) is estimated from a histogram

20
IE with Hidden Markov Models
Given a sequence of observations
Yesterday Pedro Domingos spoke this example
sentence.
and a trained HMM
person name
location name
background
Find the most likely state sequence (Viterbi)
Yesterday Pedro Domingos spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Pedro Domingos
21
HMM for Segmentation

Simplest Model One state per entity type

22
HMM Example Nymble
Bikel, et al 1998, BBN IdentiFinder
Task Named Entity Extraction
Transitionprobabilities
Observationprobabilities
Person
end-of-sentence
P(ot st , st-1 )
P(st st-1, ot-1 )
start-of-sentence
Org
P(ot st , ot-1 )
or

(Five other name classes)
Back-off to
Back-off to
P(st st-1 )
P(ot st )
Other
P(st )
P(ot )
Train on 500k words of news wire text.
Case Language F1 . Mixed
English 93 Upper English 91 Mixed Spanish 90

Results
Other examples of shrinkage for HMMs in IE
Freitag and McCallum 99
23
Popular Machine Learning Methods
For details Feldman, 2006 and Cohen, 2004

Naive Bayes
SRV Freitag 1998, Inductive Logic Programming
Rapier Califf and Mooney 1997
Hidden Markov Models Leek 1997
Maximum Entropy Markov Models McCallum et al.
2000
Conditional Random Fields Lafferty et al. 2001
Scalability
Can be labor intensive to construct training data
At run time, complex features can be expensive to
construct or process (batch algorithms can help
Chandel et al. 2006 )

24
Some Available Entity Taggers

ABNER
http//www.cs.wisc.edu/bsettles/abner/
Linear-chain conditional random fields (CRFs)
with orthographic and contextual features.
Alias-I LingPipe
http//www.alias-i.com/lingpipe/
MALLET
http//mallet.cs.umass.edu/index.php/Main_Page
Collection of NLP and ML tools, can be trained
for name entity tagging
MinorThird
http//minorthird.sourceforge.net/
Tools for learning to extract entities,
categorization, and some visualization
Stanford Named Entity Recognizer
http//nlp.stanford.edu/software/CRF-NER.shtml
CRF-based entity tagger with non-local features

25
Alias-I LingPipe ( http//www.alias-i.com/lingpipe
/ )

Statistical named entity tagger
Generative statistical model
Find most likely tags given lexical and
linguistic features
Accuracy at (or near) state of the art on
benchmark tasks
Explicitly targets scalability
100K tokens/second runtime on single PC
Pipelined extraction of entities
User-defined mentions, pronouns and stop list
Specified in a dictionary, left-to-right, longest
match
Can be trained/bootstrapped on annotated corpora

26
Relation Extraction Examples

Extract tuples of entities that are related in
predefined way

Disease Outbreaks relation
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Relation Extraction
We show that CBF-A and CBF-C interact with each
other to form a CBF-A-CBF-C complex and that
CBF-B does not interact with CBF-A or CBF-C
individually but that it associates with the
CBF-A-CBF-C complex.
From AliBaba
27
Relation Extraction Approaches

Knowledge engineering
Experts develop rules, patterns
Can be defined over lexical items ltcompanygt
located in ltlocationgt
Over syntactic structures ((Obj ltcompanygt)
(Verb located) () (Subj ltlocationgt))
Sophisticated development/debugging environments
Proteus, GATE
Machine learning
Supervised Train system over manually labeled
data
Soderland et al. 1997, Muslea et al. 2000, Riloff
et al. 1996, Roth et al 2005, Cardie et al 2006,
Mooney et al. 2005,
Partially-supervised train system by
bootstrapping from seed examples
Agichtein Gravano 2000, Etzioni et al., 2004,
Yangarber Grishman 2001,
Open (no seeds) Sekine et al. 2006, Cafarella
et al. 2007, Banko et al. 2007
Hybrid or interactive systems
Experts interact with machine learning algorithms
(e.g., active learning family) to iteratively
refine/extend rules and patterns
Interactions can involve annotating examples,
modifying rules, or any combination

28
Open Information Extraction Banko et al., IJCAI
2007

Self-Supervised Learner
All triples in a sample corpus (e1, r, e2) are
considered potential tuples for relation r
Positive examples candidate triplets generated
by a dependency parser
Train classifier on lexical features for positive
and negative examples
Single-Pass Extractor
Classify all pairs of candidate entities for some
(undetermined) relation
Heuristically generate a relation name from the
words between entities
Redundancy-Based Assessor
Estimate probability that entities are related
from co-occurrence statistics
Scalability
Extraction/Indexing
No tuning or domain knowledge during extraction,
relation inclusion determined at query time
0.04 CPU seconds pre sentence, 9M web page corpus
in 68 CPU hours
Every document retrieved, processed (parsed,
indexed, classified) in a single pass
Query-time
Distributed index for tuples by hashing on the
relation name text
Related efforts Cucerzan and Agichtein 2005,
Pasca et al. 2006, Sekine et al. 2006,
Rozenfeld and Feldman 2006,

29
Event Extraction

Similar to Relation Extraction, but
Events can be nested
Significantly more complex (e.g., more slots)
than relations/template elements
Often requires coreference resolution,
disambiguation, deduplication, and inference
Example an integrated disease outbreak event
Hatunnen et al. 2002

30
Event Extraction Integration Challenges

Information spans multiple documents
Missing or incorrect values
Combining simple tuples into complex events
No single key to order or cluster likely
duplicates while separating them from similar but
different entities.
Ambiguity distinct physical entities with same
name (e.g., Kennedy)
Duplicate entities, relation tuples extracted
Large lists with multiple noisy mentions of the
same entity/tuple
Need to depend on fuzzy and expensive string
similarity functions
Cannot afford to compare each mention with every
other.

31
Accuracy of Extraction Tasks
Feldman, ICML 2006 tutorial

Errors cascade (errors in entity tag cause errors
in relation extraction)
This estimate is optimistic
Primarily for well-established (tuned) tasks
Many specialized or novel IE tasks (e.g. bio- and
medical- domains) exhibit lower accuracy
Accuracy for all tasks is significantly lower for
non-English

32
Multilingual Information Extraction

Active research area, beyond the scope of this
talk. Nevertheless, a few (incomplete) pointers
are provided.
Closely tied to machine translation and
cross-language information retrieval efforts.
Language-independent named entity tagging and
related tasks at CoNLL
2006 multi-lingual dependency parsing
(http//nextens.uvt.nl/conll/)
2002, 2003 shared tasks language independent
Named Entity Tagging (http//www.cnts.ua.ac.be/con
ll2003/ner/)
Global Autonomous Language Exploitation program
(GALE)
http//www.darpa.mil/ipto/Programs/gale/concept.ht
m
Interlingual Annotation of Multilingual Text
Corpora (IAMTC)
Tools and data for building MT and IE systems for
six languages
http//aitc.aitcnet.org/nsf/iamtc/index.html
REFLEX project NER for 50 languages
Exploit for training temporal correlations in
weekly aligned corpora
http//l2r.cs.uiuc.edu/cogcomp/wpt.php?pr_keyREF
LEX

33
Scaling Information Extraction to the Web

Dimensions of Scalability
Corpus size
Applying rules/patterns is expensive
Need efficient ways to select/filter relevant
documents
Document accessibility
Deep web documents only accessible via a search
interface
Dynamic sources documents disappear from top
page
Source heterogeneity
Coding/learning patterns for each source is
expensive
Requires many rules (expensive to apply)
Domain diversity
Extracting information for any domain, entities,
relationships

34
Scaling Up Information Extraction

Scan-based extraction
Classification/filtering to avoid processing
documents
Sharing common tags/annotations
General keyword index-based techniques
QXtract, KnowItAll
Specialized indexes
BE/KnowItNow, Linguists Search Engine
Parallelization/distributed processing
IBM WebFountain, UIMA, Googles Map/Reduce

35
Efficient Scanning for Information Extraction
Output Tuples

Extraction System
Text Database
filtered

Extract output tuples

Process documents

Retrieve docs from database

80/20 rule use few simple rules to capture
majority of the instances Pantel et al. 2004
Train a classifier to discard irrelevant
documents without processing Grishman et al.
2002
(e.g., the Sports section of NYT is unlikely to
describe disease outbreaks)
Share base annotations (entity tags) for multiple
extraction tasks

36
Exploiting Keyword and Phrase Indexes

Generate queries to retrieve only relevant
documents
Data mining problem!
Some methods in literature
Traversing Query Graphs Agichtein et al. 2003
Iteratively refine queries Agichtein and Gravano
2003
Iteratively partition document space Etzioni et
al., 2004
Example systems QXtract, KnowItAll

37
Index Structures for Information Extraction

Bindings Engine Cafarella and Etzioni 2005
Indexing and querying entities K. Chakrabarti
et al. 2006
IBM Avatar project
http//www.almaden.ibm.com/cs/projects/avatar/
Other indexing schemes
Linguists search engine (P. Resnik)
http//lse.umiacs.umd.edu8080/
FREE Indexing regular expressions Cho and
Rajagolapan, ICDE 2002
Indexing and querying linguistic information in
XML Bird et al., 2006

38
Bindings Engine (BE) Cafarella and Etzioni 2005

Variabilized search query language
Integrates variable/type data with inverted
index, minimizing query seeks
Index ltNounPhrasegt, ltAdj-Termgt terms
Key idea neighbor index
At each position in the index, store neighbor
text both lexemes and tags
Query cities such as ltNounPhrasegt

docs

pos0
pos1
dociddocs-1
posdocs-1
docid0
docid1
as
billy
cities
friendly
give
mayors
nickels
philadelphia
such
words
19
posns
pos0
pos1
pospos-1
posns
pos0
neighbor0
pos1
neighbor1
pospos-1

12
neighbor1
str1
neighbors
blk_offset
neighbor0
str0
Result in document 19 I love cities such as
Philadelphia.
39
Parallelization/Adaptive Processing

Parallelize processing
WebFountain Gruhl et al. 2004
UIMA architecture
Map/Reduce

40
IBM WebFountain
Gruhl et al. 2004

Dedicated share-nothing 256-node cluster
Blackboard annotation architecture
Data pipelined and streamed past each augmenter
to add annotations
Merge and index annotations
Index both tokens and annotations
Between 25K-75K entities per second

41
UIMA (IBM Research)

Unstructured Information Management Architecture
(UIMA)
http//www.research.ibm.com/UIMA/
Open component software architecture for
development, composition, and deployment of text
processing and analysis components.
Run-time framework allows to plug in components
and applications and run them on different
platforms. Supports distributed processing,
failure recovery,
Scales to millions of documents incorporated
into IBM OmniFind, grid computing-ready
The UIMA SDK (freely available) includes a
run-time framework, APIs, and tools for composing
and deploying UIMA components.
Framework source code also available on
Sourceforge
http//uima-framework.sourceforge.net/

42
Map/Reduce (Dean Ghemawat, OSDI 2004)
43
Map/Reduce (continued)

General framework
Scales to 1000s of machines
Recently implemented in Nutch and other open
source efforts
Maps nicely to information extraction
Map phase
Parse individual documents
Tag entities
Propose candidate relation tuples
Reduce phase
Merge multiple mentions of same relation tuple
Resolve co-references, duplicates

44
References

Tutorials
Eugene Agichtein, Towards Web-Scale Information
Extraction, KDD 2007
http//www.mathcs.emory.edu/eugene/kdd-we
binar/
R. Feldman, Information Extraction Theory and
Practice, ICML 2006http//www.cs.biu.ac.il/feldm
an/icml_tutorial.html
W. Cohen, A. McCallum, Information Extraction and
Integration an Overview, KDD 2003
http//www.cs.cmu.edu/wcohen/ie-survey.ppt

Lecture 16: Information Extraction PowerPoint PPT Presentation