Information Extraction and Automatic Summarisation - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Information Extraction and Automatic Summarisation

Description:

IE pulls out the words and phrases most central to the meaning ... Anaphora come after their explicit mention in the text, e.g. Marie Curie was born in Warsaw. ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 13
Provided by: osirisSun
Category:

less

Transcript and Presenter's Notes

Title: Information Extraction and Automatic Summarisation


1
Information Extraction and Automatic Summarisation

2
How IE fits in with IR
  • IR selects a few relevant documents from many
  • IE starts with one or a few relevant documents
  • IE pulls out the words and phrases most central
    to the meaning of that/those documents to produce
    an extract.

3
Two process associated with information extraction
  • determination of facts to go into structured
    fields in a database.
  • extraction of text that can be used to summarise
    an item.
  • In the first case only a subset of the important
    facts in an item may be identified and extracted.
    The term slot is used to define a particular
    category of information to be extracted. Slots
    are organised into semantic frames.

4
What do we most want to know from a journal
article about agriculture?
  • AGENT chemical agent applied
  • CV cultivar (e.g. King Edward)
  • HLP high level property (e.g. yield)
  • INF influence (e.g. drought)
  • LAB site of test (e.g. laboratory)
  • LLP low level property (e.g. root mass)
  • LOC location
  • PEST pest or disease
  • SOIL soil
  • SPEC crop species (e.g. potato)

5
Automatic Abstracting
  • In the second case, rather than trying to
    determine specific facts, the goal of document
    summarisation is to extract a summary of an item
    maintaining the most important ideas while
    significantly reducing its size. For journal
    articles, this is called automatic abstracting.
    The abstract is a way for the user to determine
    the utility of an article without having to read
    the whole item.

6
Kupieks heuristics
  • Sentence length feature that requires the
    sentence to be over five words in length.
  • Fixed phrase feature that looks for the existence
    of phrase cues, e.g. in conclusion.
  • Paragraph feature that places emphasis on the
    first ten and the last five paragraphs in an item
    and also the location of the sentences within the
    paragraph.
  • Thematic word feature that uses word frequency.
  • Uppercase word feature that places emphasis on
    proper names and acronyms.
  • discovered that location based heuristics give
    better results than the frequency based features.

7
Paices rules
  • Frequency Keyword Approach First find a set of
    index terms for the document (manually,
    mid-frequency, tf idf, words occurring in the
    title, etc.). Then choose the sentences which
    contain most keywords.
  • Location The first sentence in a paragraph is
    most central to the theme of a text. The last
    sentence is the next most central.
  • Cue method Not actually keywords, but their
    presence in a document show that the sentence is
    (or is not) important. These may be bonus words,
    e.g. greatest, significant, or stigma words, e.g.
    hardly, impossible.
  • Indicator phrases, e.g. The main aim of our
    paper is to describe , Our investigation has
    shown that .

8
Hoey method cohesion in text.
  • The most important sentences in a document are
    those which are related to the largest number of
    other sentences. Find how many concepts in each
    sentence are related to concepts in other
    sentences. Concepts may be related by
  • Exact match, e.g. computer and computer
  • Grammatical variants e.g. computer, computing
  • Synonyms e.g. sedate, tranquilise, drug
  • Antonymy e.g. cold, hot
  • General-specific e.g. scientists, biologists

9
Hoey (2)
  • Form a repetition net, with entries in the form s
    ( a , b) such as 26 ( 6, 4) meaning sentence no.
    26 is bonded to 6 earlier sentences and 4 later
    sentences.
  • If a b is high, the sentence is central to the
    topic
  • If only b is high, the sentence is a topic
    opener
  • If only a is high, the sentence is topic closing.

10
Hoey (3)
  • Cohesion in text is concerned with explicit
    references within a sentence which can only be
    understood by reference to material elsewhere in
    the text.
  • Anaphora come after their explicit mention in the
    text, e.g. Marie Curie was born in Warsaw. She
    devoted her life to the study of radioactivity.
  • Cataphora come before their explicit mention in
    the text, e.g. He was to become the best known
    physicist of his generation. His name was Albert
    Einstein.

11
Generating Canned Text
  • This paper studies the effect of AGENT on the HLP
    of SPEC
  • OR
  • This paper studies the effect of INF on the HLP
    of SPEC
  • when it is infested by PEST.
  • An experiment was undertaken
  • using cultivars CV
  • in, at LOC
  • where the soil was SOIL.
  • The HLP is, are measured by analysing the LLP.

12
Extracts vs. Abstracts (Mani, p6)
  • An extract is a summary consisting entirely of
    material copied from the input
  • An abstract is a summary at least some of whose
    material is not present in the input.
Write a Comment
User Comments (0)
About PowerShow.com