Robust Semantics, Information Extraction, and Information Retrieval - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Robust Semantics, Information Extraction, and Information Retrieval

Description:

Title: Lecture Author: julia hirschberg Last modified by: julia hirschberg Created Date: 8/7/2002 3:01:55 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 20
Provided by: JuliaHir3
Category:

less

Transcript and Presenter's Notes

Title: Robust Semantics, Information Extraction, and Information Retrieval


1
  • Robust Semantics, Information Extraction, and
    Information Retrieval

2
Problems with Syntax-Driven Semantics
  • Syntactic structures often dont fit semantic
    structures very well
  • Important semantic elements often distributed
    very differently in trees for sentences that mean
    the same
  • I like soup. Soup is what I like.
  • Parse trees contain many structural elements not
    clearly important to making semantic distinctions
  • Syntax driven semantic representations are
    sometimes pretty verbose
  • V --gt serves

3
Alternatives?
  • Semantic Grammars
  • Information Extraction Techniques
  • Information Retrieval --gt Information Extraction

4
Semantic Grammars
  • Alternative to modifying syntactic grammars to
    deal with semantics too
  • Define grammars specifically in terms of the
    semantic information we want to extract
  • Domain specific Rules correspond directly to
    entities and activities in the domain
  • I want to go from Boston to Baltimore on
    Thursday, September 24th
  • Greeting --gt HelloHiUm
  • TripRequest ? Need-spec travel-verb from City to
    City on Date

5
Predicting User Input
  • Semantic grammars rely upon knowledge of the task
    and (sometimes) constraints on what the user can
    do, when
  • Allows them to handle very sophisticated
    phenomena
  • I want to go to Boston on Thursday.
  • I want to leave from there on Friday for
    Baltimore.
  • TripRequest ? Need-spec travel-verb from City on
    Date for City
  • Dialogue postulate maps filler for from-city to
    pre-specified from-city

6
Drawbacks of Semantic Grammars
  • Lack of generality
  • A new one for each application
  • Large cost in development time
  • Can be very large, depending on how much coverage
    you want
  • If users go outside the grammar, things may break
    disastrously
  • I want to leave from my house.
  • I want to talk to someone human.

7
Information Extraction
  • Another robust alternative
  • Idea is to extract particular types of
    information from arbitrary text or transcribed
    speech
  • Examples
  • Named entities people, places, organizations,
    times, dates
  • Telephone numbers
  • ltOrganizationgt MIPSlt/Organizationgt Vice President
    ltPersongtJohn Himelt/Persongt
  • Domains Medical texts, broadcast news,
    voicemail,...

8
Appropriate where Semantic Grammars and
Syntactic Parsers are Not
  • Appropriate where information needs very specific
  • Question answering systems, gisting of news or
    mail
  • Job ads, financial information, terrorist attacks
  • Input too complex and far-ranging to build
    semantic grammars
  • But full-blown syntactic parsers are impractical
  • Too much ambiguity for arbitrary text
  • 50 parses or none at all
  • Too slow for real-time applications

9
Information Extraction Techniques
  • Often use a set of simple templates or frames
    with slots to be filled in from input text
  • Ignore everything else
  • My number is 212-555-1212.
  • The inventor of the wiggleswort was Capt. John T.
    Hart.
  • The king died in March of 1932.
  • Context (neighboring words, capitalization,
    punctuation) provides cues to help fill in the
    appropriate slots

10
The IE Process
  • Given a corpus and a target set of items to be
    extracted
  • Clean up the corpus
  • Tokenize it
  • Do some hand labeling of target items
  • Extract some simple features
  • POS tags
  • Phrase Chunks
  • Do some machine learning to associate features
    with target items or derive this associate by
    intuition
  • Use e.g. FSTs, simple or cascaded to iteratively
    annotate the input, eventually identifying the
    slot fillers

11
Some examples
  • Semantic grammars
  • Information extraction

12
Information Retrieval
  • How related to NLP?
  • Operates on language (speech or text)
  • Does it use linguistic information?
  • Stemming
  • Bag-of-words approach
  • Does it make use of document formatting?
  • Headlines, punctuation, captions
  • Collection a set of documents
  • Term a word or phrase
  • Query a set of terms

13
Butwhat is a term?
  • Stop list
  • Stemming
  • Homonymy, polysemy, synonymy

14
Vector Space Model
  • Simple versions represent documents and queries
    as feature vectors, one binary feature for each
    term in collection
  • Is t in this document or query or not?
  • D (t1,t2,,tn)
  • Q (t1,t2,,tn)
  • Similarity metrichow many terms does a query
    share with each candidate document?
  • Weighted terms term-by-document matrix
  • D (wt1,wt2,,wtn)
  • Q (wt1,wt2,,wtn)

15
  • How do we compare the vectors?
  • Normalize each term weight by the number of terms
    in the document how important is each t in D?
  • Compute dot product between vectors to see how
    similar they are
  • Cosine of angle 1 identity 0 no common
    terms
  • How do we get the weights?
  • Term frequency (tf) how often does t occur in D?
  • Inverse document frequency (idf) docs/ docs
    term t occurs in
  • tf . idf weighting weight of term i for doc j is
    product of frequency of i in j with log of idf in
    collection

16
Evaluating IR Performance
  • Precision rel docs returned/total docs
    returned -- how often are you right when you say
    this document is relevant?
  • Recall rel docs returned/rel docs in
    collection -- how many of the relevant documents
    do you find?
  • F-measure combines P and R

17
Improving Queries
  • Relevance feedback users rate retrieved docs
  • Query expansion many techniques
  • e.g. add top N docs retrieved to query
  • Term clustering cluster rows of terms to produce
    synonyms and add to query

18
IR Tasks
  • Ad hoc retrieval normal IR
  • Routing/categorization assign new doc to one of
    predefined set of categories
  • Clustering divide a collection into N clusters
  • Segmentation segment text into coherent chunks
  • Summarization compress a text by extracting
    summary items
  • Question-answering find a stretch of text
    containing the answer to a question

19
Summary
  • Many approaches to robust semantic analysis
  • Semantic grammars targeting particular domains
  • Utterance --gt Yes/No Reply
  • Yes/No Reply --gt Yes-Reply No-Reply
  • Yes-Reply --gt yes,yeah, right, ok,you bet,
  • Information extraction techniques targeting
    specific tasks
  • Extracting information about terrorist events
    from news
  • Information retrieval techniques --gt more like NLP
Write a Comment
User Comments (0)
About PowerShow.com