CSE 635 Multimedia Information Retrieval - PowerPoint PPT Presentation

Loading...

PPT – CSE 635 Multimedia Information Retrieval PowerPoint presentation | free to download - id: 70e746-NzUwY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

CSE 635 Multimedia Information Retrieval

Description:

CSE 635 Multimedia Information Retrieval Information Extraction – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 37
Provided by: Rohi6
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: CSE 635 Multimedia Information Retrieval


1
CSE 635 Multimedia Information Retrieval
  • Information Extraction

2
Overview
  • Introduction to IE
  • Named Entity tagger
  • HMM approach
  • Relationship/Event detection
  • Text Mining
  • intelligence applications

3
Information Extraction
  • What is IE
  • The identification of instances of a particular
    class of events or relationships in a natural
    language text, and the extraction of the relevant
    arguments of the event or relationship. (MUC, de
    facto)
  • Information Extraction involves the creation of a
    structured representation (such as a database) of
    selected information drawn from the text.
    (Grishman 1997)
  • identification of key entities, relationships
    between them, and significant activity involving
    these entities (Srihari)
  • Goals of IE
  • transform unstructured text into
    structured/semi-structured text
  • automatic template-filling
  • automatically populate databases
  • facilitate information discovery
  • sometimes, what you dont know is most important
    if you know what you are looking for, use a
    search engine! IE permits information discovery

4
Information to Intelligence
Unstructured Data
People
Company
Information
Product
Entities, relationships, events
Ronald Brumback Named Pres. COO of Top Layer
Networks
INTC drops X
What caused INTC shares to drop?
Top INTC executive, John Doe, leaves to join
Transmeta as VP Engineering
RF Micro Devices Introduces Cellular CDMA LNA and
PA Driver Amplifier with Bypass Switch
Microsoft, Lockheed eye federal deals
Intelligence
C-bridge, eXcelon to merge
Transmeta Scores Latest Crusoe Win with Sharp
Text mining, analytics
FedEx to Cut 130 Jobs in Texas
Whats new from RFMD?
5
Levels of Information Extraction
  • MUC identifies the following levels of
    extraction
  • Named Entity Tagging
  • Bill Gates is the chairman of Microsoft
  • Relationship Detection leads to entity profiles
  • chairman-of(Bill Gates, Microsoft)
  • Event Detection
  • executive change
  • person_in, person_out
  • company_involved
  • date
  • Scenario Extraction
  • Bombing incident
  • where
  • of casualties
  • reason
  • follow-up
  • events involved ordered sequentially

6
Named Entity Tagging
Bridgestone Sports Co. said Friday it has set
up a joint venture in Hong Kong with a local
concern and a Japanese trading house to produce
golf clubs to be shipped to Japan. The joint
venture, Bridgestone Sports Hong Kong Co.,
capitalized at 20 million Hong Kong dollars, will
start production in January 1990 with production
of 20,000 iron and "metal wood" clubs a month.
The monthly output will be later raised to 50,000
units, Bridgestone Sports spokesman Tom White
said. The new company, based in Kaohsiung,
southern Hong Kong , is owned 75 pct by
Bridgestone Sports, 15 pct by Union Precision
Casting Co. of Hong Kong and the remainder by
Taga Co., a company active in trading with Hong
Kong, the officials said.
7
Output of Named Entity Tagger
ltcompanygt Bridgestone Sports Co. lt/companygt
said ltdategt Friday lt/datagt it has set up a joint
venture in ltcitygtHong Kong lt/citygt with a local
concern and a ltethnicgt Japanese lt/ethnicgt
trading house to produce golf clubs to be
shipped to ltcountrygt Japan lt/countrygt. The
joint venture, ltcompanygt Bridgestone Sports Hong
Kong Co. lt/companygt, capitalized at ltmoneygt 20
million Hong Kong dollars lt/moneygt, will start
production in ltdategt January 1990 lt/dategt with
production of 20,000 iron and "metal wood" clubs
a month.The monthly output will be later raised
to 50,000 units, ltcompanygt Bridgestone Sports
lt/companygt spokesman ltmangt Tom White lt/mangt,
said.
8
Named-Entity Definition
  • Named-entity is a word or phrase that denotes a
    proper name such as person, organization,
    location, product, temporal expression and
    numerical expression.
  • Name classes are associated with individual
    words.
  • A named-entity is associated with a contiguous
    word sequence with the same name class.

9
Entity Profiles
ltPerson Profile id1gt
ltPerson Profile id1gt

Waleed Alshehri
name

Waleed
aliases

a Saudi commercial pilot
position

mid
-
20s
age
gender

MALE

Embry
-
Riddle Aeronautical
education
University

FlightSafety
Academy
Satam
Al
Suqami

associations
Wail
Alshehri

Homing Inn
American Flight 11

lt
graduated
gt
Events
-
involved
lt
hijacking
gt
lt
suicide attack
gt
descriptors

quiet and private
Middle Eastern
backgrounds
another of the eventual
hijackers
10
Event Detection
  • Event ltMOVEMENTgt
  • who 23 foreign fighters
  • whereto into Pakistan
  • Location Pakistan, Afghanistan
  • When normalStr020622 Monday
  • Snippet
  • Pakistan said Monday its troops arrested 23
    foreign fighters trying to cross from
  • Afghanistan into Pakistan over the weekend.
  • Event ltCONTRACTgt
  • Money_involved 5.9 million (8.9 million)
  • Who CVF Team, ThomsonCSF,Lockheed
    Martin, Raytheon, BMT Defense Services, Defense
    Procurement Agency
  • When normalStr021100 last November
  • Snippet
  • The BAE Systems-led CVF Team and a rival
    Thomson-CSF group, including
  • Lockheed Martin, Raytheon and BMT Defense
    Services, were awarded parallel 5.9
  • million (8.9 million) contracts by the Defense
    Procurement Agency last November
  • to undertake first-stage assessment phase work
    for CVF.

11
3 Major Approaches to IE
  • Layout-based
  • wrapper induction
  • application focused e.g. jobs database,
    processing resumes, etc.
  • IR-based
  • concept extraction
  • uses techniques such as pattern matching,
    proximity, co-occurrence
  • often seen in Knowledge Management applications
    (e.g. hardware)
  • NLP-based
  • statistical techniques (POS tagging, NE tagging)
  • grammatical techniques
  • more sophisticated levels of IE possible

12
Convergence of NLP-driven and IR-driven
Approaches to IE
Focus on Recall
Focus on Precision
13
Challenges in IE
  • Normalization
  • temporal references (today, last year, during the
    Olympics )
  • spatial references (Buffalo)
  • Alias resolution
  • George Bush, President Bush
  • IBM, the company
  • Verb concepts
  • kill, murder, assassinate, etc.
  • Diversity of sources
  • web documents, e-mail, powerpoint, speech/OCR
    transcripts
  • sophisticated pre-processing required
  • Cross-document information consolidation
  • Rapid domain porting
  • Intuitive user interface
  • should support decision making
  • work flow, visualization, etc.

14
Homeland Defense Track Key Entities Based on
Watch Lists
15
Name-Class Definition
OR organization CO company Bridgesto
ne Sports Co., Bridgestone
Sports Hong Kong Co., Bridgestone Sports LO
location CI city Hong Kong, CT
country Japan PE person MAN man
Tom White TI time DA
date Friday NN not name said, it has set
up a joint venture, with a
local concern and a , trading house to produce
golf
16
Name-Class Tree
There are 6 top-level name-classes, and 35
sub-type name-classes. Time -- Hour, Part
Day, Duration,Frequency, Age, Day, Month, Season,
Year, Decade, Century Location --
City, Province, Country, Continent, Ocean, lake,
River, Mountain, Road, Region, District, Airport
Organization --Company, Government,
Association, School, Army, Mass Media Person
-- Man, Woman Product -- Vehicle,
Software Event -- Conference, Exhibition
17
Application of Named Entity Tagging
  • Question-Answering System
  • Q Where did Bridgestone Sports Co. set up a
    joint venture?
  • A Hong Kong
  • Q When did Bridgestone Sports Hong Kong Co.
    start
  • production?
  • A January 1990
  • Q Who is the spokesman for Bridgestone Sports?
  • A Tom White

18
Question Asking Points and Named Entities
Where ? Location Q Where did Bridgestone
Sports Co. set up a joint venture? A Hong
Kong When ? Time Q When did
Bridgestone Sports Hong Kong Co. start
production? A January 1990
Who ? Person Q Who is the spokesman for
Bridgestone Sports? A Tom White
19
Application of Named Entity Tagging (condt.)
Support other Information Extraction tasks
Extract Correlated Entities (relationship)
entity 1 Tom White man relation
employed by entity 2 Bridgestone Sports
company Extract events predicate
start argument 1 Bridgestone Sports Hong Kong
Co company argument 2 production time
January 1990 date
20
Other Applications of NE
  • Search engines
  • text categorization/filtering
  • data mining

21
Statistical Model for Named Entity Tagging
  • Given a sequence of words (W), our goal is to
    find the sequence of name-class (NC) with maximum
    Pr(NCW).
  • For example
  • word sequence
  • it has set up a joint venture in Hong Kong
  • Possible name-class sequence
  • it has set up a joint
    venture in Hong Kong
  • NN NN NN NN NN NN
    NN NN LO LO
  • LO NN NN NN NN NN
    NN NN OR LO

22
Statistical Model for Named Entity Tagging
(contd.)
  • Construct a manually tagged training corpus.
  • Extract necessary statistics from the corpus to
    build a statistical model which can automatically
    compute Pr(NC Seqeunce W Sequence) for unseen
    data.
  • Search the NC sequence which maximizes the
    probability Pr(NC Sequence W Sequence)

Corpus
Statistical Model
unseen data
tagging
23
Statistical Model for Named Entity Tagging
(contd.)
  • The size of the training corpus is large enough
    to provide fairly good unigram and bigram
    information.
  • unigram example Pr(Organization US)
  • bigram example Pr(Orgaization US, the)
  • The size of the training corpus is too small to
    support any direct evaluation beyond bigram.
  • Question How to evaluate Pr(NC Sequence
    Sentence) based on the above unigram and bigram
    information.
  • One solution transfer the conditional
    probability into (NC,Sentence) joint probability
    (Bayes rule)
  • Decouple sentence into bigram sequences (Markov
    assumption)

24
Bayes Rule

Using Bayes rule, we have
25
Markov Assumption
By Markov assumption, we have
26
Markov Assumption (condt.)
So the final formula is
27
Hidden Markov Model
  • Define Hidden Markov Model as follows
  • An output alphabet ?0,1,V-1
  • A state space ?1,2,c
  • A transition probability distribution between
    states and associated output symbols p(symboln,
    staten symboln-1, staten-1).
  • In case of named entity tagging, regard word as
    output symbol, and the tags as the states. The
    above statistical NE model is a Hidden Markov
    Model.
  • W1 W2 W3 W4
    ..
  • ltSSgt PE PE PE PE
  • LO LO LO LO
  • OR OR OR OR

28
Statistics Estimation
The generation of words and name-class proceeds
in three steps
The Most Likelihood Estimation (MLE) of the above
probabilities are as follows
29
Easy and Difficult Cases
  • Some cases are easy
  • Matsushita Electric Industrial Co. has reached
    agreement
  • Victor C. of Japan (JVC) and Sony Corp. ...
  • Some cases are particularly difficult
  • In a factory of Blaupunkt Weke, a Robert Bosch
    subsidiary,
  • Touch Panel Systems, capitalized at 50 million
    Yen is owned ...

30
Machine learning vs. handcrafted rules
  • Handcrafted finite state patterns can be very
    effective
  • ltproper-noungt ltcorporate designatorgt --gt
    ltcorporationgt
  • e.g. Sony Corp.
  • Problems with handcrafted approach
  • each new source requires tweaking, i.e. domain
    porting can be tedious
  • speech recognition transcript, OCR require
    modification of rules
  • rules for different languages are radically
    different
  • Machine learning approach more scalable
  • exception numerical expressions, other patterns
    which are very regular, e.g. contact information
  • telephone numbers, URLs, postal addresses, etc.

31
NE tagger- Bikel et al
  • PDF file

32
Viterbi Search
Viterbi search algorithm is used to search the NC
sequence which maximizes the following
probability W1 W2 W3
W4 .. ltSSgt PE PE
PE PE LO LO
LO LO OR OR
OR OR Best paths reach nodes associated
with w1 is self-clear. 3 paths reaches the node
(W2, PE) (PE PE 1.0), (LO,PE, -1.5),
(OR,PE,-0.95). The best path reaching (W2,PE) is
(OR,PE,-0.95) Compute the best paths reaching the
nodes associated with w2. Keep the best reaching
path only and continue the same computation to
the next word.
-0.2
-0.8
-1.2
-0.3
-0.9
-0.05
33
What next?
  • We know how to tag Nes locally. What next?
  • Alias resolution
  • George W. Bush, President Bush, Bush
  • Relationship extraction
  • affiliation
  • spouse
  • address
  • Event Detection
  • Entity Profiles

34
Extracting relationships and events
  • Two major approaches
  • grammatical
  • statistical
  • Grammatical approaches
  • requires SVO parsing, semantic parsing as a first
    step
  • follow up by specialized relationship and event
    extraction grammars
  • Two approaches here also
  • one behemoth grammar (CFG)
  • cascaded, finite state grammars
  • Statistical approaches
  • supervised learning approach
  • unsupervised approach using extraction patterns

35
Architecture of InfoXtract Engine/Platform
Document Processor
Legend
Natural Language Processing
Source Document
Tokenizer
Zoned Text Document
HTTP Post
Web
HTTP
Process Manager
Server
Linguistic Modules
CORBA
Lexicon Lookup
Token List
POS Tagging
Output Manager
XML Formatted Extracted Document
Hybrid Model
NE
Token List
Named Entity
FST Module
Detection
HTTP Response
CE
Number
Procedure or
Normalization
Statistical Model
Hybrid
Time/location
Document Error Log
SVO
Module
Normalization
Shallow
CO
Parsing
NE Named Entity CE Correlated Entity SVO
Subject-Verb-Object CO Co-reference GE General
Event PE Pre-defined Event POS Part Of
Speech FST Finite State Transducer
Semantic
Parsing
Knowledge Resources
Relationship
Profile
Detection
Alias/Coreference
Linking
Lexicon
GE
Resources
Pragmatic
Filtering
Profile/Event
Merge
Grammars
PE
Profile/Event
Linking
36
Adapting FSTs for NLP engines
  • Traditionally, FSTs have operated on character
    streams- both input and output
  • primarily used in lexical transducers
  • InfoXtract tokenizer converts input stream into
    tokenlist all subsequent modules operate on
    tokenlist
  • tokenlist contains the following information
  • linguistic features (POS, semantic class from
    WordNet etc.)
  • linguistic structures derived from NLP (e.g.,
    SVO)
  • information extraction output NE, relationships,
    events
  • pointers to tokens (text offsets)
  • real objects (text strings) as well as virtual
    objects
  • FST grammars operate on tokenlists and can
    utilize features at several levels
  • character/string level, structure level
  • equivalent to tree-walking automata
About PowerShow.com