Processing of large document collections - PowerPoint PPT Presentation

About This Presentation
Title:

Processing of large document collections

Description:

'Process of distilling the most important information from a source to produce an ... produce a report documenting how an engineer investigated what new technology is ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 80
Provided by: helenaah
Category:

less

Transcript and Presenter's Notes

Title: Processing of large document collections


1
Processing of large document collections
  • Part 3

2
Text summarization
  • Process of distilling the most important
    information from a source to produce an abridged
    version for a particular user or task

3
Text summarization
  • Many everyday uses
  • headlines (from around the world)
  • outlines (notes for students)
  • minutes (of a meeting)
  • reviews (of books, movies)
  • ...

4
Architecture of a text summarization system
  • Input
  • a single document or multiple documents
  • text, images, audio, video
  • database

5
Architecture of a text summarization system
  • output
  • extract or abstract
  • compression rate
  • ratio of summary length to source length
  • connected text or fragmentary
  • generic or user-focused/domain-specific
  • indicative or informative

6
Architecture of a text summarization system
  • Three phases
  • analyzing the input text
  • transforming it into a summary representation
  • synthesizing an appropriate output form

7
Condensation operations
  • Selection of more salient or non-redundant
    information
  • aggregation of information (e.g. from different
    parts of the source, or of different linguistic
    descriptions)
  • generalization of specific information with more
    general, abstract information

8
The level of processing
  • Surface level
  • entity level
  • discourse level

9
Surface-level approaches
  • Tend to represent information in terms of shallow
    features
  • the features are then selectively combined
    together to yield a salience function used to
    extract information

10
Surface level
  • Shallow features
  • thematic features
  • presence of statistically salient terms, based on
    term frequency statistics
  • location
  • position in text, position in paragraph, section
    depth, particular sections
  • background
  • presence of terms from the title or headings in
    the text, users query

11
Surface level
  • Cue words and phrases
  • in summary, our investigation
  • emphasizers like important, in particular
  • domain-specific bonus ( ) and stigma (-) terms

12
Entity-level approaches
  • Build an internal representation for text
  • modeling text entities and their relationships
  • tend to represent patterns of connectivity in the
    text to help determine what is salient

13
Relationships between entities
  • Similarity (e.g. vocabulary overlap)
  • proximity (distance between text units)
  • co-occurrence (words related based on their
    occurring in common contexts)
  • thesaural relationships among words (synonymy,
    hypernymy, part-of relations)
  • co-reference (of referring expressions such as
    noun phrases)

14
Relationships between entities
  • Logical relationships (agreement, contradiction,
    entailment, consistency)
  • syntactic relations (based on parse trees)
  • meaning representation-based relations (e.g.
    based on predicate-argument relations)

15
Discourse-level approaches
  • Model the global structure of the text and its
    relation to communicative goals
  • structure can include
  • format of the document (e.g. hypertext markup)
  • threads of topics as they are revealed in the
    text
  • rhetorical structure of the text, such as
    argumentation or narrative structure

16
Classical approaches
  • Luhn 58
  • Edmundson 69

17
Luhns method
  • Filter terms in the document using a stoplist
  • Terms are normalized based on aggregating
    together ortographically similar terms
  • Frequencies of aggregated terms are calculated
    and non-frequent terms are removed

18
Luhns method
  • Sentences are weighted using the resulting set
    of significant terms and a term density measure
  • each sentence is divided into segments bracketed
    by significant terms not more than 4
    non-significant terms apart

19
Luhns method
  • each segment is scored by taking the square of
    the number of bracketed significant terms divided
    by the total number of bracketed terms
  • the score of the highest scoring segment is taken
    as the sentence score
  • the highest scoring sentences are chosen to the
    summary

20
Edmundsons method
  • Extends earlier work to look at three features in
    addition to word frequencies
  • cue phrases (e.g. significant, impossible,
    hardly)
  • title and heading words
  • location

21
Edmundsons method
  • Programs to weight sentences based on each of the
    four methods separately
  • programs were evaluated by comparison against
    manually created extracts
  • corpus-based methodology training set and test
    set
  • in the training phase, weights were manually
    readjusted

22
Edmundsons method
  • Results
  • three additional features dominated word
    frequency measures
  • the combination of cue-title-location was the
    best, with location being the best individual
    feature
  • keywords alone was the worst

23
Fundamental issues
  • What are the most powerful but also more general
    features to exploit for summarization?
  • How do we combine these features?
  • How can we evaluate how well we are doing?

24
Corpus-based approaches
  • In the classical methods, various features
    (thematic features, title, location, cue phrase)
    were used to determine the salience of
    information for summarization
  • an obvious issue determine the relative
    contribution of different features to any given
    text summarization task

25
Corpus-based approaches
  • Contribution is dependent on the text genre, e.g.
    location
  • in newspaper stories, the leading text often
    contains a summary
  • in TV news, a preview segment may contain a
    summary of the news to come
  • in scientific text an author-written abstract

26
Corpus-based approaches
  • The importance of different text features for any
    given summarization problem can be determined by
    counting the occurrences of such features in text
    corpora
  • in particular, analysis of human-generated
    summaries, along with their full-text sources,
    can be used to learn rules for summarization

27
Corpus-based approaches
  • One could use a corpus to model particular
    components, without using a completely trainable
    approach
  • e.g. a corpus can be used to compute weights
    (TFIDF)

28
Corpus-based approaches
  • Challenges
  • creating a suitable text corpus, designing an
    annotation scheme
  • ensuring the suitable set of summaries is
    available
  • may already be available scientific papers
  • if not author, professional abstractor, judge

29
KPC method
  • Kupiec, Pedersen, Chen (1995) A Trainable
    Document Summarizer
  • a learning method using a corpus of abstracts
    written by professional human abstractors
    (Engineering Information Co.)
  • naïve Bayesian classification method is used

30
KPC method features
  • Sentence-length cut-off feature
  • given a threshold (e.g. 5 words), the feature is
    true for all sentences longer than the threshold,
    and false otherwise
  • fixed-phrase feature
  • this feature is true for sentences that contain
    any of 26 indicator phrases (e.g. this letter,
    In conclusion), or that follow section head
    that contain specific keywords (e.g. results,
    conclusion)

31
KPC method features
  • Paragraph feature
  • sentences in the first 10 paragraphs and the last
    5 paragraphs in a document get a higher value
  • in paragraphs paragraph-initial,
    paragraph-final, paragraph-medial are
    distinguished

32
KPC method features
  • thematic word feature
  • a small number of thematic words (the most
    frequent content words) are selected
  • each sentence is scored as a function of
    frequency of the thematic words
  • highest scoring sentences are selected
  • binary feature feature is true for a sentence,
    if the sentence is present in the set of highest
    scoring sentences

33
KPC method features
  • Uppercase word feature
  • proper names and explanatory text for acronyms
    are usually important
  • feature is computed like the thematic word
    feature
  • an uppercase thematic word
  • is not sentence-initial and begins with a
    capital letter and must occur several times
  • first occurrence is scored twice as much as later
    occurrences

34
KPC method classifier
  • For each sentence, we compute the probability it
    will be included in a summary S given the k
    features Fj, j1k
  • the probability can be expressed using Bayes
    rule

35
KPC method classifier
  • Assuming statistical independence of the
    features
  • P(s?S) is a constant, and P(Fj s?S) and P(Fj)
    can be estimated directly from the training set
    by counting occurrences

36
KPC method corpus
  • Corpus is acquired from Engineering Information
    Co, which provides abstracts of technical
    articles to online information services
  • articles do not have author-written abstracts
  • abstracts were created by professional abstractors

37
KPC method corpus
  • 188 document/summary pairs sampled from 21
    publications in the scientific/technical domain
  • summaries are mainly indicative, average length
    is 3 sentences
  • average number of sentences in the original
    documents is 86
  • author, address, and bibliography were removed

38
KPC method sentence matching
  • The abstracts from the human abstractors are not
    extracts but inspired by the original sentences
  • the automatic summarization task here
  • extract sentences that the human abstractor might
    have chosen to prepare summary text (with minor
    modifications)

39
KPC method sentence matching
  • For training, a correspondence between the manual
    summary sentences and sentences in the original
    document need to be obtained
  • matching can be done in several ways

40
KPC method sentence matching
  • matching can be done in several ways
  • a direct sentence match
  • the same sentence is found in both
  • a direct join
  • 2 or more original sentences were used to form a
    summary sentence
  • summary sentence can be unmatchable
  • summary sentence (single or joined) can be
    incomplete

41
KPC method sentence matching
  • Matching was done in two passes
  • first, the best one-to-one sentence matches were
    found automatically (79)
  • second, these matches were used as a starting
    point for the manual assignment of
    correspondences

42
KPC method evaluation
  • Cross-validation strategy for evaluation
  • documents from a given journal were selected for
    testing one at a time all other document/summary
    pairs were used for training
  • unmatchable and incomplete sentences were
    excluded
  • total of 498 unique sentences

43
KPC method evaluation
  • Two ways of evaluation
  • the fraction of manual summary sentences that
    were faithfully reproduced by the summarizer
    program
  • the summarizer produced the same number of
    sentences as were in the corresponding manual
    summary
  • -gt 35
  • 83 is the highest possible value, since
    unmatchable and incomplete sentences were excluded

44
KPC method evaluation
  • The fraction of the matchable sentences that
    were correctly identified by the summarizer
  • -gt 42
  • the effect of different features was also studied
  • best combination (44) paragraph, fixed-phrase,
    sentence-length
  • baseline selecting sentences from the beginning
    of the document (result 24)
  • if 25 of the original sentences selected 84

45
Discourse-based approaches
  • Discourse structure appears to play an important
    role in the strategies used by human abstractors
    and in the structure of their abstracts
  • an abstract is not just a collection of
    sentences, but it has an internal structure
  • -gt abstract should be coherent and it should
    represent some of the argumentation used in the
    source

46
Discourse models
  • Cohesion
  • relations between words or referring expressions,
    which determine how tightly connected the text is
  • anaphora, ellipsis, synonymy, hypernymy (dog
    is-a-kind animal)
  • coherence
  • overall structure of a multi-sentence text in
    terms of macro-level relations between sentences
    (e.g. although -gt contrast)

47
Boguraev, Kennedy (BG)
  • Goal identify those phrasal units across the
    entire span of the document that best function as
    representative highlights of the documents
    content
  • these phrasal units are called topic
    stamps
  • a set of topic stamps is called capsule
    overview

48
BG
  • A capsule overview
  • not a set/sequence of sentences
  • a semi-formal (normalised) representaion of the
    document, derived after a process of data
    reduction over the original text
  • not always very readable, but still represents
    the flow of the narrative
  • can be combined with surrounding information to
    produce more coherent presentation

49
BG
  • Primary consideration methods should apply to
    any document type and source (domain
    independence)
  • also efficient and scalable technology
  • shallow syntactic analysis, no comprehensive
    parsing engine needed

50
BG
  • Based on the findings on technical terms
  • technical terms have such linguistic properties
    that can be used to find terms automatically in
    different domains quite reliably
  • technical terms seem to be topical
  • task of content characterization
  • identifying phrasal units that have
  • lexico-syntactic properties similar to technical
    terms
  • discourse properties that signify their status as
    most prominent

51
BG terms as content indicators
  • Problems
  • undergeneration
  • overgeneration
  • differentiation

52
Undergeneration
  • a set of phrases should contain an exhaustive
    description of all the entities that are
    discussed in the text
  • the set of technical terms has to be extended to
    include also expressions with pronouns etc.

53
Overgeneration
  • already the set of technical terms can be large
  • extensions make the information overload even
    worse
  • solution phrases that refer to one participant
    in the discourse are combined with referential
    links

54
Differentiation
  • The same list of terms may be used to describe
    two documents, even if they, e.g., focus on
    different subtopics
  • it is necessary to differentiate term sets not
    only according to their membership, but also
    according to the relative representativeness of
    the terms they contain

55
Term sets and coreference classes
  • Phrases are extracted using a phrasal grammar
    (e.g. a noun with modifiers)
  • also expressions with pronouns and incomplete
    expressions are extracted
  • using a (Lingsoft) tagger that provides
    information about the part of speech, number,
    gender, and grammatical function of tokens in a
    text
  • solves the undergeneration problem

56
Term sets and coreference classes
  • The phrase set has to be reduced to solve the
    problem of overgeneration
  • -gt a smaller set of expressions that uniquely
    identify the objects referred to in the text
  • application of anaphora resolution
  • e.g. to which noun a pronoun he refers to?

57
Resolving coreferences
  • Procedure
  • moving through the text sentence by sentence and
    analysing the nominal expressions in each
    sentence from left to right
  • either an expression is identified as a new
    participant in the discourse, or it is taken to
    refer to a previously mentioned referent

58
Resolving coreferences
  • Coreference is determined by a 3 step procedure
  • a set of candidates is collected all nominals
    within a local segment of discourse
  • some candidates are eliminated due to
    morphological mismatches or syntactical
    restrictions
  • remaining candidates are ranked according to
    their relative salience in the discourse

59
Salience factors
  • sent(term) 100 iff term is in the current
    sentence
  • cntx(term) 50 iff term is in the current
    discourse segment
  • subj(term) 80 iff term is a subject
  • acc(term) 50 iff term is a direct object
  • dat(term) 40 iff term is an indirect obj
  • ...

60
Local salience of a candidate
  • The local salience of a candidate is the sum of
    the values of the salience factors
  • the most salient candidate is selected as the
    antecedent for the anaphor
  • if the coreference link cannot be established to
    some other expression, the nominal is taken to
    introduce a new referent
  • -gt coreferent classes

61
Topic stamps
  • In order to further reduce the referent set, some
    additional structure has to be imposed
  • the term set is ranked according to the salience
    of its members
  • relative prominence or importance in the
    discourse of the entities to which they refer
  • objects in the centre of discussion have a high
    degree of salience

62
Saliency
  • Measured like local saliency in coreference
    resolution, but tries the measure the importance
    of unique referents in the discourse

63
Priest is charged with Pope attack
A Spanish priest was charged here today with
attempting to murder the Pope. Juan Fernandez
Krohn, aged 32, was arrested after a man armed
with a bayonet approached the Pope while he was
saying prayers at Fatima on Wednesday
night. According to the police, Fernandez told
the investigators today that he trained for the
past six months for the assault. He was alleged
to have claimed the Pope looked furious on
hearing the priests criticism of his handling of
the churchs affairs. If found quilty, the
Spaniard faces a prison sentence of 15-20 years.
64
Saliency
  • priest is the primary element
  • eight references to the same actor in the body of
    the story
  • these reference occur in important syntactic
    positions 5 are subjects of main clauses, 2 are
    subjects of embedded clauses, 1 is a possessive
  • Pope attack is also important
  • Pope occurs 5 times, but not in so important
    positions (2 are direct objects)

65
Discourse segments
  • If the intention is to use very concise
    descriptions of one or two salient phrases, i.e.
    topic stamps, longer text have to be broken down
    into smaller segments
  • topically coherent, contiguous segments can be
    found by using a lexical similarity measure
  • assumption distribution of words used changes
    when the topic changes

66
BG Summarization process
  • Linguistic analysis
  • discourse segmentation
  • extended phrase analysis
  • anaphora resolution
  • calculation of discourse salience
  • topic stamp identification
  • capsule overview

67
Knowledge-rich approaches
  • Structured information can be used as the
    starting point for summarization
  • structured information e.g. data and knowledge
    bases, may have been produced by processing input
    text
  • summarizer does not have to address the
    linguistic complexities and variability of the
    input, but also the structure of the input text
    is not available

68
Knowledge-rich approaches
  • There is a need for measures of salience and
    relevance that are dependent on the knowledge
    source
  • addressing coherence, cohesion, and fluency
    becomes the entire responsibility of the generator

69
STREAK, PLANDOC
  • McKeown, Robin, Kukich (1995) Generating concise
    natural language summaries
  • goal folding information from multiple facts
    into a single sentence using concise linguistic
    constructions

70
STREAK
  • Produces summaries of basketball games
  • first creates a draft of essential facts
  • then uses revision rules constrained by the draft
    wording to add in additional facts as the text
    allows

71
STREAK
  • Input
  • a set of box scores for a basketball game
  • historical information (from a database)
  • task
  • summarize the highlights of the game,
    underscoring their significance in the light of
    previous games
  • output
  • a short summary a few sentences

72
STREAK
  • The box score input is represented as a
    conceptual network that expresses relations
    between what were the columns and rows of the
    table
  • essential facts the game result, its location,
    date and at least one final game statistic (the
    most remarkable statistic of a winning team
    player)

73
STREAK
  • Essential facts can be obtained directly from the
    box-score
  • in addition, other potential facts
  • other notable game statistics of individual
    players - from box-score
  • game result streaks (Utah recorded its fourth
    straight win) - historical
  • extremum performances such as maximums or
    minimums - historical

74
STREAK
  • Essential facts are always included
  • potential facts are included if there is space
  • decision on the potential facts to be included
    could be based on the possibility to combine the
    facts to the essential information in cohesive
    and stylistically successful ways

75
STREAK
  • Given facts
  • Karl Malone scored 39 points.
  • Karl Malones 39 point performance is equal to
    his season high
  • a single sentence is produced
  • Karl Malone tied his season high with 39 points

76
PLANDOC
  • Produces summaries of telephone network planning
    activity
  • uses discourse planning, looking ahead in its
    text plan to group together facts which can be
    expressed concisely using conjunction and
    deleting repetitions

77
PLANDOC
  • The system must produce a report documenting how
    an engineer investigated what new technology is
    needed in a telephone route to meet demand
    through use of a sophisticated software planning
    system

78
PLANDOC
  • Input
  • a trace of user interaction with the planning
    system software PLAN
  • output
  • 1-2 page report, including a paragraph summary of
    PLANs solution, a summary of refinements than an
    engineer made to the system solution, and a
    closing paragraph summarizing the engineers
    final proposition

79
Summary generation
  • Summaries must convey maximal information in a
    minimal amount of space
  • requires the use of complex sentence structures
  • multiple modifiers of a noun or a verb
  • conjunction (and)
  • ellipsis (deletion of repetitions)
  • selection of words that convey multiple aspects
    of the information
Write a Comment
User Comments (0)
About PowerShow.com