Some Linguistic Implications of the SHARES Project - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Some Linguistic Implications of the SHARES Project

Description:

1138 Brett Favre, the Packers' quarterback, and his teammates tried desperately ... 1140 Favre dropped back. 1141 The pocket pressure was intense. ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 23
Provided by: antoinet7
Category:

less

Transcript and Presenter's Notes

Title: Some Linguistic Implications of the SHARES Project


1
System of Hypermatrix Analysis, Retrieval,
Evaluation and Summarisation
Some Linguistic Implications ofthe SHARES Project
J. BanerjeeRDUESUniversity of Liverpool
2
Principle of Lexical Cohesion Hypermatrix
Structure
  • Basic hypothesis Similar patterns of lexical
    repetition occur in texts on similar topics
  • Networks of lexical repetition are used to
    identify closely related sentences across texts,
    and thereby similar texts
  • The hypermatrix structure identifies links
    between repeated words and bonds between
    closely linked sentences
  • The number of bonded sentences between a pair
    of texts gives their match score. This is
    higher for texts on similar topics

3
Links Bonds
Indonesia and Malaysia have taken their first
sips of the bitter medicine of economic
retrenchment, scaling back their growth plans
halting expensive building projects and
announcing austerity measures.
The central bank voted again on Tuesday not to
raise interest rates, apparently in the belief
that economic growth is now moderating and that
any tightening would risk further destabilizing
Asia at a time of political unrest in Indonesia.
Links 3, Bond 1 if Link Threshold 3
4
Links Bonds
Article A
S1 w1 w2 w3 w4 w1 w5 w6 w7 w8
S2 w9 w10 w1 w3 w4 w11 w12
w5 w13
S8 w9 w2 w14 w15 w3 w16 w5
w17 w18
Article B
S9 w19 w20 w3 w4 w5 w11 w12
w5 w13 w21
Links 3 Bond 1
5
Links Bonds
Article A
S1 w1 w2 w3 w4 w1 w5 w6 w7 w8
S2 w9 w10 w1 w3 w4 w11 w12
w5 w13
S8 w9 w2 w14 w15 w3 w16 w5
w17 w18
Article B
S9 w19 w20 w3 w4 w5 w11 w12
w5 w13 w21
Links 3 Bond 1
6
Links Bonds
S1 w1 w2 w3 w4 w1 w5 w6 w7 w8
Article A
S2 w9 w10 w1 w3 w4 w11 w12
w5 w13
Article B
S8 w9 w2 w14 w15 w3 w16 w5
w17 w18
S9 w19 w20 w3 w4 w5 w11 w12
w5 w13 w21
Links 3 Bond 1
7
Links Bonds
S1 w1 w2 w3 w4 w1 w5 w6 w7 w8
Article A
S2 w9 w10 w1 w3 w4 w11 w12
w5 w13
Article B
S8 w9 w2 w14 w15 w3 w16 w5
w17 w18
S9 w19 w20 w3 w4 w5 w11 w12
w5 w13 w21
Links 6 Bond 1
8
Visualisation Interface
9
TDT2 Corpus
  • Corpus designed for US Topic Detection
    Tracking programme
  • Consists of US newspaper articles (NYT), radio
    broadcasts (VOA, APW), television broadcasts
    (CNN, ABC)
  • 64,527 articles 1,111,445 sentences 20,232,752
    tokens
  • 100 specified topics
  • 8040 articles with topics assigned (Some
    articles with 2 or more topics , Some articles
    misclassified

10
Mini Test Corpus
  • 33 articles, 11 topics with 3 articles per topic
  • 1259 sentences, 27948 tokens, 5999 types
  • Topic Groups
  • A Asian Economic Crisis
  • B Monica Lewinsky
  • C McVeighs Navy Dismissal Fight
  • D Fossetts Balloon Ride
  • E Pope Visits Cuba
  • F 1998 Winter Olympics
  • G Current Conflict with Iraq
  • H Violence in Algeria
  • I Quality of Life, NYC
  • J Superbowl 98
  • K China Air Crash

11
Weighting Issues Short Sentences
1136 Gosh, the dreaded AFC streak is over. 1137
Denver is finally king and so is the AFC. 1138
Brett Favre, the Packers quarterback, and his
teammates tried desperately until the
finish. 1139 The Packers had the ball at the
Denver 31 with 32 seconds left and faced a
fourth-and-6. 1140 Favre dropped back. 1141 The
pocket pressure was intense. 1142 He looked for
tight end Mark Chmura, but John Mobley was there
to swat the ball away. 1143 Thus, for Denver,
for the AFC, so was all of that misery. 1144
Swatted away. 1145 John Elway, the Denver
quarterback, got his ring in his fourth Super
Bowl try and the Broncos reign.
12
Weighting Issues Long Sentences
236 In a case that has unified gay-rights groups
and advocates of cyberspace privacy, the petty
officer has filed suit in U.S. District Court in
Washington to save his Navy career, arguing that
his dismissal is the result of a violation both
of federal privacy laws and of the Defense
Department's don't ask, don't tell policy, which
was supposed to put an end to aggressive
campaigns to ferret out homosexuals in the
military. (72 tokens)
13
Weighting Issues Frequent Words
- Article A (Asian Economic Crisis) 1 New
York/ers It's been quite shocking to see the
situation deteriorate to the extent that it has
said Leslie Richardson, the managing director of
the Asian Equities Division for SocGen-Crosby
Securities in New York - Article B (Quality
of Life in NYC) 10 New York/ers e.g. A
contract dispute has strained the mayor's
relations with the rank-and-file officers, who
have balked at ticketing jaywalking, which is
illegal but practiced by most New Yorkers. A
Marist College poll this week placed the mayor
first among registered voters in New York as a
possible candidate in a Republican presidential
primary Gov. George Pataki came in fourth He
added that he would push for city workers to
treat New Yorkers politely and for children in
the public elementary schools to take civics
classes an idea welcomed Wednesday by Schools
Chancellor Rudy Crew. These poses, political
experts said, are aimed at that aspect of the
city that inspires a mixture of fear and
fascination among suburban sensibilities, like
the character of the out-of-control New York
cabdriver that David Letterman has popularized
across the nation. In an address Wednesday
announcing the second phase of his quality of
life campaign for New York City.

14
Weighting Theme Rheme
15
Weighting Theme Rheme
16
Weighting Strategies
  • Bond counts are given weights based on factors
    such as
  • Observed vs. expected frequencies of linking
    words
  • (so that rare word links weighted higher than
    common
  • word links) e.g. expected frequency of New
    York in Article B 1.42
  • observed frequency 10 (high value indicates
    likely topic word)
  • Z scores of linking words
  • e.g. New York 0.84 (low value indicates
    uniform distribution i.e. non-topic word)
  • Document length
  • Sentence length

17
Link Matrix
18
Weighted Link Matrix
19
Similarity Scores
  • Inter-document similarity score is generated
  • Bond counts weighted by relative word frequency
    in reference TDT corpus and by document
    length
  • Weighted bond counts aggregated over sentence
    pairs within
  • any document pair
  • Weighted bond counts scaled to similarity scores
    such that
  • Similarity Score 0 if two documents have
    no bonds
  • 1 if two documents are as bonded
  • with each other as they are with
  • themselves

20
Similarity Score Matrix
21
Document Similarity Representation
  • 11 groups of 3 documents

22
Future Work
  • Thesaural input to improve recall
  • Investigation of lemmatisation
  • Large-scale testing
  • Proper noun identification/ resolution
Write a Comment
User Comments (0)
About PowerShow.com