Lecture 16: Information Extraction - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Lecture 16: Information Extraction

Description:

Cincinnati, Ohio 45210. Pawel Opalinski, Software. Engineer at WhizBang Labs. E.g. word patterns: ... Candidates. Abraham Lincoln was born in Kentucky. ... – PowerPoint PPT presentation

Number of Views:242
Avg rating:3.0/5.0
Slides: 46
Provided by: ChengXi4
Category:

less

Transcript and Presenter's Notes

Title: Lecture 16: Information Extraction


1
Lecture 16 Information Extraction
Oct. 26, 2007 ChengXiang Zhai
Most slides are from Eugene Agichteins and
William Cohens tutorials
2
The Value of Text Data
  • Unstructured text data is the primary form of
    human-generated information
  • Blogs, web pages, news, scientific literature,
    online reviews,
  • Semi-structured data (database generated) see
    Prof. Bing Lius KDD webinar http//www.cs.uic.ed
    u/liub/WCM-Refs.html
  • The techniques discussed here are complimentary
    to structured object extraction methods
  • Need to extract structured information to
    effectively manage, search, and mine the data
  • Information Extraction mature, but active
    research area
  • Intersection of Computational Linguistics,
    Machine Learning, Data mining, Databases, and
    Information Retrieval
  • Traditional focus on accuracy of extraction
  • Recently attention paid to scalability

3
Example Answering Queries Over Text
For years, Microsoft Corporation CEO Bill Gates
was against open source. But today he appears to
have changed his mind. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Select Name From PEOPLE Where Organization
Microsoft
PEOPLE
Name Title Organization Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman Founder Free
Soft..
Bill Gates Bill Veghte
(from William Cohens IE tutorial, 2003)
4
IE History Pre-Web
  • Mostly news articles
  • De Jongs FRUMP 1982
  • Hand-built system to fill Schank-style scripts
    from news wire
  • Message Understanding Conference (MUC) DARPA
    87-95, TIPSTER 92-96
  • Early work dominated by hand-built models
  • E.g. SRIs FASTUS, hand-built FSMs.
  • But by 1990s, some machine learning Lehnert,
    Cardie, Grishman and then HMMs Elkan Leek 97,
    BBN Bikel et al 98

5
IE History Web
  • AAAI 94 Spring Symposium on Software Agents
  • Much discussion of ML applied to Web. Maes,
    Mitchell, Etzioni.
  • Tom Mitchells WebKB, 96
  • Build KBs from the Web.
  • Wrapper Induction
  • Initially hand-build, then ML Soderland 96,
    Kushmeric 97,
  • Citeseer Cora FlipDog contEd courses,
    corpInfo,
  • WebFountain (IBM)
  • KnowItAll (University of Washington)

6
IE History Other Domains
  • Biology
  • Gene/protein entity extraction
  • Protein/protein fact interaction
  • Automated curation/integration of databases
  • At CMU SLIF (Murphy et al, subcellular
    information from images text in journal
    articles)
  • At UIUC BeeSpace (http//www.beespace.uiuc.edu/)
  • Email
  • EPCA, PAL, RADAR, CALO intelligent office
    assistant that understands some part of email
  • At CMU web site update requests, office-space
    requests calendar scheduling requests social
    network analysis of email.

7
Landscape of IE Tasks (1/4)Degree of Formatting
Text paragraphs without formatting
Grammatical sentencesand some formatting links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,rich formatting links
Tables
8
Landscape of IE Tasks (2/4)Intended Breadth of
Coverage
Web site specific
Genre specific
Wide, non-specific
Formatting
Layout
Language
Amazon.com Book Pages
Resumes
University Names
9
Landscape of IE Tasks (3/4)Complexity
E.g. word patterns
Regular set
Closed set
U.S. phone numbers
U.S. states
Phone (413) 545-1323
He was born in Alabama
The CALD main office can be reached at
412-268-1299
The big Wyoming sky
Ambiguous patterns,needing context andmany
sources of evidence
Complex pattern
U.S. postal addresses
Person names
University of Arkansas P.O. Box 140 Hope, AR
71802
was among the six houses sold by Hope Feldman
that year.
Pawel Opalinski, SoftwareEngineer at WhizBang
Labs.
Headquarters 1128 Main Street, 4th
Floor Cincinnati, Ohio 45210
10
Landscape of IE Tasks (4/4)Single Field/Record
Jack Welch will retire as CEO of General Electric
tomorrow. The top role at the Connecticut
company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
N-ary record
Person Jack Welch
Relation Person-Title Person Jack
Welch Title CEO
Relation Succession Company General
Electric Title CEO Out
Jack Welsh In Jeffrey Immelt
Person Jeffrey Immelt
Relation Company-Location Company General
Electric Location Connecticut
Location Connecticut
Named entity extraction
11
Landscape of IE Techniques (1/1)Models
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama Alaska Wisconsin Wyoming
Any of these models can be used to capture words,
formatting or both.
12
Hand-Coded Methods
  • Easy to construct in some cases
  • e.g., to recognize prices, phone numbers, zip
    codes, conference names, etc.
  • Intuitive to debug and maintain
  • Especially if written in a high-level language
  • Can incorporate domain knowledge
  • Scalability issues
  • Labor-intensive to create
  • Highly domain-specific
  • Often corpus-specific
  • Rule-matches can be expensive

IBM Avatar
13
Machine Learning Methods
  • Can work well when lots of training data easy to
    construct
  • Can capture complex patterns that are hard to
    encode with hand-crafted rules
  • e.g., determine whether a review is positive or
    negative
  • extract long complex gene names
  • Non-local dependencies

14
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
15
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
16
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
17
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
18
A Naïve Bayes Sliding Window Model
Freitag 1997
00 pm Place Wean Hall Rm 5409
Speaker Sebastian Thrun


w t-m
w t-1
w t
w tn
w tn1
w tnm
prefix
contents
suffix
Estimate Pr(LOCATIONwindow) using Bayes
rule Try all reasonable windows (vary length,
position) Assume independence for length, prefix
words, suffix words, content words Estimate from
data quantities like Pr(Place in
prefixLOCATION)
If P(Wean Hall Rm 5409 LOCATION) is above
some threshold, extract it.
19
BWI Learning to detect boundaries
Freitag Kushmerick, AAAI 2000
  • Another formulation learn three probabilistic
    classifiers
  • START(i) Prob( position i starts a field)
  • END(j) Prob( position j ends a field)
  • LEN(k) Prob( an extracted field has length k)
  • Then score a possible extraction (i,j) by
  • START(i) END(j) LEN(j-i)
  • LEN(k) is estimated from a histogram

20
IE with Hidden Markov Models
Given a sequence of observations
Yesterday Pedro Domingos spoke this example
sentence.
and a trained HMM
person name
location name
background
Find the most likely state sequence (Viterbi)
Yesterday Pedro Domingos spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Pedro Domingos
21
HMM for Segmentation
  • Simplest Model One state per entity type

22
HMM Example Nymble
Bikel, et al 1998, BBN IdentiFinder
Task Named Entity Extraction
Transitionprobabilities
Observationprobabilities
Person
end-of-sentence
P(ot st , st-1 )
P(st st-1, ot-1 )
start-of-sentence
Org
P(ot st , ot-1 )
or

(Five other name classes)
Back-off to
Back-off to
P(st st-1 )
P(ot st )
Other
P(st )
P(ot )
Train on 500k words of news wire text.
Case Language F1 . Mixed
English 93 Upper English 91 Mixed Spanish 90

Results
Other examples of shrinkage for HMMs in IE
Freitag and McCallum 99
23
Popular Machine Learning Methods
For details Feldman, 2006 and Cohen, 2004
  • Naive Bayes
  • SRV Freitag 1998, Inductive Logic Programming
  • Rapier Califf and Mooney 1997
  • Hidden Markov Models Leek 1997
  • Maximum Entropy Markov Models McCallum et al.
    2000
  • Conditional Random Fields Lafferty et al. 2001
  • Scalability
  • Can be labor intensive to construct training data
  • At run time, complex features can be expensive to
    construct or process (batch algorithms can help
    Chandel et al. 2006 )

24
Some Available Entity Taggers
  • ABNER
  • http//www.cs.wisc.edu/bsettles/abner/
  • Linear-chain conditional random fields (CRFs)
    with orthographic and contextual features.
  • Alias-I LingPipe
  • http//www.alias-i.com/lingpipe/
  • MALLET
  • http//mallet.cs.umass.edu/index.php/Main_Page
  • Collection of NLP and ML tools, can be trained
    for name entity tagging
  • MinorThird
  • http//minorthird.sourceforge.net/
  • Tools for learning to extract entities,
    categorization, and some visualization
  • Stanford Named Entity Recognizer
  • http//nlp.stanford.edu/software/CRF-NER.shtml
  • CRF-based entity tagger with non-local features

25
Alias-I LingPipe ( http//www.alias-i.com/lingpipe
/ )
  • Statistical named entity tagger
  • Generative statistical model
  • Find most likely tags given lexical and
    linguistic features
  • Accuracy at (or near) state of the art on
    benchmark tasks
  • Explicitly targets scalability
  • 100K tokens/second runtime on single PC
  • Pipelined extraction of entities
  • User-defined mentions, pronouns and stop list
  • Specified in a dictionary, left-to-right, longest
    match
  • Can be trained/bootstrapped on annotated corpora

26
Relation Extraction Examples
  • Extract tuples of entities that are related in
    predefined way

Disease Outbreaks relation
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Relation Extraction
We show that CBF-A and CBF-C interact with each
other to form a CBF-A-CBF-C complex and that
CBF-B does not interact with CBF-A or CBF-C
individually but that it associates with the
CBF-A-CBF-C complex.
From AliBaba
27
Relation Extraction Approaches
  • Knowledge engineering
  • Experts develop rules, patterns
  • Can be defined over lexical items ltcompanygt
    located in ltlocationgt
  • Over syntactic structures ((Obj ltcompanygt)
    (Verb located) () (Subj ltlocationgt))
  • Sophisticated development/debugging environments
  • Proteus, GATE
  • Machine learning
  • Supervised Train system over manually labeled
    data
  • Soderland et al. 1997, Muslea et al. 2000, Riloff
    et al. 1996, Roth et al 2005, Cardie et al 2006,
    Mooney et al. 2005,
  • Partially-supervised train system by
    bootstrapping from seed examples
  • Agichtein Gravano 2000, Etzioni et al., 2004,
    Yangarber Grishman 2001,
  • Open (no seeds) Sekine et al. 2006, Cafarella
    et al. 2007, Banko et al. 2007
  • Hybrid or interactive systems
  • Experts interact with machine learning algorithms
    (e.g., active learning family) to iteratively
    refine/extend rules and patterns
  • Interactions can involve annotating examples,
    modifying rules, or any combination

28
Open Information Extraction Banko et al., IJCAI
2007
  • Self-Supervised Learner
  • All triples in a sample corpus (e1, r, e2) are
    considered potential tuples for relation r
  • Positive examples candidate triplets generated
    by a dependency parser
  • Train classifier on lexical features for positive
    and negative examples
  • Single-Pass Extractor
  • Classify all pairs of candidate entities for some
    (undetermined) relation
  • Heuristically generate a relation name from the
    words between entities
  • Redundancy-Based Assessor
  • Estimate probability that entities are related
    from co-occurrence statistics
  • Scalability
  • Extraction/Indexing
  • No tuning or domain knowledge during extraction,
    relation inclusion determined at query time
  • 0.04 CPU seconds pre sentence, 9M web page corpus
    in 68 CPU hours
  • Every document retrieved, processed (parsed,
    indexed, classified) in a single pass
  • Query-time
  • Distributed index for tuples by hashing on the
    relation name text
  • Related efforts Cucerzan and Agichtein 2005,
    Pasca et al. 2006, Sekine et al. 2006,
    Rozenfeld and Feldman 2006,

29
Event Extraction
  • Similar to Relation Extraction, but
  • Events can be nested
  • Significantly more complex (e.g., more slots)
    than relations/template elements
  • Often requires coreference resolution,
    disambiguation, deduplication, and inference
  • Example an integrated disease outbreak event
    Hatunnen et al. 2002

30
Event Extraction Integration Challenges
  • Information spans multiple documents
  • Missing or incorrect values
  • Combining simple tuples into complex events
  • No single key to order or cluster likely
    duplicates while separating them from similar but
    different entities.
  • Ambiguity distinct physical entities with same
    name (e.g., Kennedy)
  • Duplicate entities, relation tuples extracted
  • Large lists with multiple noisy mentions of the
    same entity/tuple
  • Need to depend on fuzzy and expensive string
    similarity functions
  • Cannot afford to compare each mention with every
    other.

31
Accuracy of Extraction Tasks
Feldman, ICML 2006 tutorial
  • Errors cascade (errors in entity tag cause errors
    in relation extraction)
  • This estimate is optimistic
  • Primarily for well-established (tuned) tasks
  • Many specialized or novel IE tasks (e.g. bio- and
    medical- domains) exhibit lower accuracy
  • Accuracy for all tasks is significantly lower for
    non-English

32
Multilingual Information Extraction
  • Active research area, beyond the scope of this
    talk. Nevertheless, a few (incomplete) pointers
    are provided.
  • Closely tied to machine translation and
    cross-language information retrieval efforts.
  • Language-independent named entity tagging and
    related tasks at CoNLL
  • 2006 multi-lingual dependency parsing
    (http//nextens.uvt.nl/conll/)
  • 2002, 2003 shared tasks language independent
    Named Entity Tagging (http//www.cnts.ua.ac.be/con
    ll2003/ner/)
  • Global Autonomous Language Exploitation program
    (GALE)
  • http//www.darpa.mil/ipto/Programs/gale/concept.ht
    m
  • Interlingual Annotation of Multilingual Text
    Corpora (IAMTC)
  • Tools and data for building MT and IE systems for
    six languages
  • http//aitc.aitcnet.org/nsf/iamtc/index.html
  • REFLEX project NER for 50 languages
  • Exploit for training temporal correlations in
    weekly aligned corpora
  • http//l2r.cs.uiuc.edu/cogcomp/wpt.php?pr_keyREF
    LEX

33
Scaling Information Extraction to the Web
  • Dimensions of Scalability
  • Corpus size
  • Applying rules/patterns is expensive
  • Need efficient ways to select/filter relevant
    documents
  • Document accessibility
  • Deep web documents only accessible via a search
    interface
  • Dynamic sources documents disappear from top
    page
  • Source heterogeneity
  • Coding/learning patterns for each source is
    expensive
  • Requires many rules (expensive to apply)
  • Domain diversity
  • Extracting information for any domain, entities,
    relationships

34
Scaling Up Information Extraction
  • Scan-based extraction
  • Classification/filtering to avoid processing
    documents
  • Sharing common tags/annotations
  • General keyword index-based techniques
  • QXtract, KnowItAll
  • Specialized indexes
  • BE/KnowItNow, Linguists Search Engine
  • Parallelization/distributed processing
  • IBM WebFountain, UIMA, Googles Map/Reduce

35
Efficient Scanning for Information Extraction
Output Tuples

Extraction System
Text Database
filtered
  1. Extract output tuples
  1. Process documents
  1. Retrieve docs from database
  • 80/20 rule use few simple rules to capture
    majority of the instances Pantel et al. 2004
  • Train a classifier to discard irrelevant
    documents without processing Grishman et al.
    2002
  • (e.g., the Sports section of NYT is unlikely to
    describe disease outbreaks)
  • Share base annotations (entity tags) for multiple
    extraction tasks

36
Exploiting Keyword and Phrase Indexes
  • Generate queries to retrieve only relevant
    documents
  • Data mining problem!
  • Some methods in literature
  • Traversing Query Graphs Agichtein et al. 2003
  • Iteratively refine queries Agichtein and Gravano
    2003
  • Iteratively partition document space Etzioni et
    al., 2004
  • Example systems QXtract, KnowItAll

37
Index Structures for Information Extraction
  • Bindings Engine Cafarella and Etzioni 2005
  • Indexing and querying entities K. Chakrabarti
    et al. 2006
  • IBM Avatar project
  • http//www.almaden.ibm.com/cs/projects/avatar/
  • Other indexing schemes
  • Linguists search engine (P. Resnik)
    http//lse.umiacs.umd.edu8080/
  • FREE Indexing regular expressions Cho and
    Rajagolapan, ICDE 2002
  • Indexing and querying linguistic information in
    XML Bird et al., 2006

38
Bindings Engine (BE) Cafarella and Etzioni 2005
  • Variabilized search query language
  • Integrates variable/type data with inverted
    index, minimizing query seeks
  • Index ltNounPhrasegt, ltAdj-Termgt terms
  • Key idea neighbor index
  • At each position in the index, store neighbor
    text both lexemes and tags
  • Query cities such as ltNounPhrasegt

docs

pos0
pos1
dociddocs-1
posdocs-1
docid0
docid1
as
billy
cities
friendly
give
mayors
nickels
philadelphia
such
words
19
posns
pos0
pos1
pospos-1
posns
pos0
neighbor0
pos1
neighbor1
pospos-1



12
neighbor1
str1
neighbors
blk_offset
neighbor0
str0
Result in document 19 I love cities such as
Philadelphia.
39
Parallelization/Adaptive Processing
  • Parallelize processing
  • WebFountain Gruhl et al. 2004
  • UIMA architecture
  • Map/Reduce

40
IBM WebFountain
Gruhl et al. 2004
  • Dedicated share-nothing 256-node cluster
  • Blackboard annotation architecture
  • Data pipelined and streamed past each augmenter
    to add annotations
  • Merge and index annotations
  • Index both tokens and annotations
  • Between 25K-75K entities per second

41
UIMA (IBM Research)
  • Unstructured Information Management Architecture
    (UIMA)
  • http//www.research.ibm.com/UIMA/
  • Open component software architecture for
    development, composition, and deployment of text
    processing and analysis components.
  • Run-time framework allows to plug in components
    and applications and run them on different
    platforms. Supports distributed processing,
    failure recovery,
  • Scales to millions of documents incorporated
    into IBM OmniFind, grid computing-ready
  • The UIMA SDK (freely available) includes a
    run-time framework, APIs, and tools for composing
    and deploying UIMA components.
  • Framework source code also available on
    Sourceforge
  • http//uima-framework.sourceforge.net/

42
Map/Reduce (Dean Ghemawat, OSDI 2004)
43
Map/Reduce (continued)
  • General framework
  • Scales to 1000s of machines
  • Recently implemented in Nutch and other open
    source efforts
  • Maps nicely to information extraction
  • Map phase
  • Parse individual documents
  • Tag entities
  • Propose candidate relation tuples
  • Reduce phase
  • Merge multiple mentions of same relation tuple
  • Resolve co-references, duplicates

44
References
  • Tutorials
  • Eugene Agichtein, Towards Web-Scale Information
    Extraction, KDD 2007
  • http//www.mathcs.emory.edu/eugene/kdd-we
    binar/
  • R. Feldman, Information Extraction Theory and
    Practice, ICML 2006http//www.cs.biu.ac.il/feldm
    an/icml_tutorial.html
  • W. Cohen, A. McCallum, Information Extraction and
    Integration an Overview, KDD 2003
    http//www.cs.cmu.edu/wcohen/ie-survey.ppt

45
What Should You Know
  • Information extraction is key to convert
    unstructured data to structured data
  • Basic tasks in information extraction (entities,
    relations, events)
  • Basic ideas of some of the methods
Write a Comment
User Comments (0)
About PowerShow.com