Topic Detection and Tracking TDT - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Topic Detection and Tracking TDT

Description:

Combining Semantic and Syntactic Document Classifiers to Improve First Story Detection ... following words {plane, airplane, pilot, cockpit, airhostess, wing, engine} ... – PowerPoint PPT presentation

Number of Views:408
Avg rating:3.0/5.0
Slides: 53
Provided by: rea51
Category:

less

Transcript and Presenter's Notes

Title: Topic Detection and Tracking TDT


1
Topic Detection and Tracking (TDT)
  • CSCI 6403 PRESENTATION

Weizheng Gao Xuhao Lai Litao Ou
2
Outline
  • Introduction
  • Combining Semantic and Syntactic Document
    Classifiers to Improve First Story Detection
  • First Story Detection In TDT Is Hard
  • Topic Detection System in Broadcast News

3
What is TDT?
?. INTRODUCTION 1
  • TDT refers to automatic techniques for finding
    topically related material in streams of data.

For example
4
To this
(blocks with the same color are stories about the
same event in several media)
5
Research Applications Defined in the TDT
  • Story Segmentation
  • Topic Tracking
  • Topic Detection
  • First Story Detection
  • Link Detection

6
1. Story Segmentation
  • Detect changes between topically cohesive
    sections


7
2. Topic Tracking
  • Keep track of stories similar to a set of
    example stories

8
3. Topic Detection
  • Build clusters of stories that discuss the same
    topic

9
4. First Story Detection
  • Detect if a story is the first story of a
    topic

10
5. Link Detection
  • Detect whether or not two stories are topically
    linked

11
?. Combining Semantic and Syntactic Document
Classifiers to Improve First Story Detection 2
  • A Document Representation Strategy Using Lexical
    Chains
  • Detection Using Two Classifiers
  • Conclusions

12
A Document Representation Strategy Using Lexical
Chains
  • The cohesive structure can be explored and
    represented by lexical chains
  • For example airplanes A typical
    chain might consist of the
  • following words plane, airplane,
    pilot, cockpit, airhostess,
  • wing, engine
  • Chain words address two linguistic problems
    associated with traditional syntactic
    representations
  • synonymy polysemy

13
A Document Representation Strategy Using Lexical
Chains (cont)
  • Terms must be added to the most recently updated
    chain.
  • Proper nouns is the second element of combined
    document representation

Prompt the correct disambiguation of a word
based on the context in which it was used.
14
Detection Using Two Classifiers
  • 1. Convert the current document into a weighted
    chain word vector and a weighted proper noun
    vector.
  • 2. The first document becomes the first cluster.
  • 3. Subsequent incoming documents are compared
    with previously created clusters.
  • 4. Find the most similar cluster and discover
    whether this document satisfies the similarity
    condition. If not, the document is declared to
    discuss a new event, or it will form the seed of
    a new cluster.
  • 5. This process continues until all documents
    have been classified.

15
Conclusions
  • The results show that an increase in system
    effectiveness is achieved when lexical chain
    (semantic) representations are used in
    conjunction with proper noun (syntactic)
    representations.

?A miss occurs when the system fails to detect a
new event. ? False alarms occur when the system
indicates a story contains a new event when it
does not.
16
?. First Story Detection In TDT Is Hard 3
17
Two Tasks of TDT
  • Topic Tracking
  • In Tracking, the system is given a small number
    Nt, of stories that are known to be on the same
    event-based news topic. The system then monitors
    the stream of subsequent news stories for ones
    that are on the same topic.

18
Two Tasks of TDT
  • First Story Detection (FSD)
  • It also monitors a stream of arriving news
    stories. However, the task is to mark each story
    as first or not first to indicate whether or
    not it is the first one discussing a news topic.
  • The system provides a score for each story,
    where a high score indicates confidence that a
    story is first

19
Tracking
  • TDT tracking task is fundamentally similar to
    IRs information filtering task
  • Each begins with a representation of a topic and
    then monitors a stream of arriving documents,
    making decisions about documents as they arrive
    (without a deferral period).

20
Topic
  • Filtering is subject based
  • Tracking is event based
  • No user feedback after tracking begins

21
TDT-2 Corpus
  • Approximately 60,000 news stories from January
    through June of 1998
  • First four months of data for parameter tuning
    for tracking
  • Final two months for evaluation

22
Tracking System
  • A vector model for representing stories
  • A vector centroid for representing topics
  • Incoming stories are compared to the topic
    centroid
  • On-topic
  • Off-topic

23
FSD System
  • Same as the tracking system
  • Incoming stories are compared to every story that
    had appeared in the past
  • If the new story exceeds a threshold with any one
    of the stories, it is considered old, else it is
    considered new

24
TDT Assumption And Relevance
  • Stories are on a single topic
  • Multiple topics are not judged
  • IR query
  • TDT topic

25
Evaluation Measures(Effectiveness)
26
Evaluation Measures
  • Recall Precision
  • Miss False alarm
  • Richness

27
Bounds on FSD
  • One possible solution to FSD is to apply tracking
    technology
  • Intuitively, the system marks the first story of
    the corpus with a very high score, if the second
    story tracks, it is assigned a low FSD score

28
Relating tracking and FSD
  • The probability we miss the first story for topic
    i

29
Relating tracking and FSD
  • The topic-weighted average value

30
Relating tracking and FSD (assume that topic
error rates are independent)
  • The lower bound possibility of a FSD alarm for
    topic i

31
Relating tracking and FSD (assume that topic
error rates are independent)
  • The upper bound

32
Expected FSD Performance(Figure 2)
33
Expected FSD Performance(Figure 3)
34
Difficulty of improving FSD(Figure 4)
35
Complexity Analysis
  • Possible to reduce the TDT FSD problem to the TDT
    tracking problem
  • Tracking problem is NP-complete, then FSD is hard
  • Knowing about such relationships may help avoid
    redundant research or unnecessary investigative
    dead-ends

36
Project Implementation withFirst Story Detection
based on Tracking (Part III)
  • Corpus A.txt, M.txt
  • First Story Detection based on Tracking
  • Evaluation Measures Recall, Precision, Miss,
    False alarm, Richness

37
?. Topic Detection System in Broadcast News 4
INTRODUCTION
  • Concerned with unsupervised grouping of news
    according to topic
  • Create story groupings through clustering
  • Involved with the stories on the same topic
  • Incremental k-means algorithm
  • Probabilistic Similarity metric
  • Selection and thresholding metrics
  • Experimental results

38
Incremental k-means algorithm
  • Find the closer cluster and decide whether to
    merge
  • Iterate through the stories and make changes
    during each iteration
  • Correct poor initial clusters
  • Computational requirement is less imposing


39
Probabilistic Similarity Metric
  • Utilize the BBN topic spotting metric
  • Calculate P(CS) for topic detection
  • Derived from the Bays rule
  • Assuming that the story words are conditionally
    independent, we get
  • where p (snC) is the probability that a word in
    a story on the topic
  • represented by cluster C would be sn.
  • where p(C) is the apriori probability that any
    new story will be



    relevant to cluster C.


40
Two-state model for a topic
  • BBN topic spotting metric two-state model
  • Model p (snC)
  • One state is a distribution of the words in all
    of the stories in the group
  • The other is a distribution from the whole
    corpus

41
Clustering Metric
  • 1. Selection Metric
  • Take a story and output cluster scores
  • BBN topic spotting metric finds the most
    topical cluster to a story
  • The selection metric could be chosen such that
  • Where sm are the story words p(smC) is
    computed according to the above model D(S,C) is
    a justifiable metric for doing cluster selection

42
Experiment Evidence
A data set of clusters was extracted from each
of TDT-1, 2, 3 Each cluster contains stories on
one topic The misclassification rates for each
data set are given in above table Probabilistic
metric is candidate for the selection problem

43
  • 2. Thresholding Metric
  • Determine whether or not a story belongs in a
    cluster
  • Combine scores and features from the system
  • Score Normalization
  • Cosine distance metrics are naturally
    normalized
  • Length-normed Tspot
  • Simply divides the log probability by story
    length
  • Mean/sd-normed Tspot
  • Depends very little on the length of story
  • Normalized score is also reasonable

44
Corpus and Evaluation
? Linguistic Data Consortium (LDC) released a
corpus referred to as the TDT-2 corpus ? Consist
of 60,000 stories and subdivided into three
two-month sets from both newswire and audio
sources ? An annotator determines which of the
predefine topics are relevant to the story ? A
judgment YES,BRIEF, and NO ? Official
evaluation metric (weighted cost function)
45
Weighted Cost function
  • Topic-weighted score
  • Count each topics contribution to the total
    cost equally
  • Story-weighted score
  • Count each storys contribution to the total
    cost equally
  • CD is final detection cost,
  • PM and Cm are the probability and cost of
    miss
  • PFA and CFA are the probability and cost of
    the false accept
  • PT is the a priori probability of a topic
  • Note official evaluation is based on
    topic-weighted score

46
Effect of Transcription Method
ASR (audio source) transcripts tend to have high
error rate of about 23,but are relatively
consistent CCAP( closed-captioned data)
transcripts have a smaller error rate, but
inconsistent NWT (newswire stories) transcripts
have a lowest error rate
47
Different normalization schemes
? Combination 1 depends on both cosine distance
and Length-normalized Tspot ? Combination 2
depends on Cosine distance and Mean-normalized
Tspot

48
Differences Between Data Sets
? Jan-Feb data has a few very broad topics and a
few focused ones( affects the system
performance) ? Mar-Apr data roughly 1/8 the
number of labeled stories than the Jan-Feb data
set ? May-Jun set contains roughly 3 times the
number of labeled stories Mar-Apr ?
Table3Results showing the correlation of CD with
average topic size (using CCAPNWT data)
49
Subset Experiments
  • The effect of multi-topic stories that contain
    non-annotated topics
  • The data set used is Mar-APR CCAPNWT
  • Create a data subset that contains only the
    stories that were annotated YES for one topic

50
Conclusion
  • Cluster news stories according to topic
  • Use K-means clustering algorithm to group the
    stories
  • The clustering algorithm requires two types of
    clustering metrics selection and thresholding
  • System uses BBN metric for the selection metric
  • System uses a hybrid of the BBN metric with a
    conventional cosine distance metric for
    threshholding

51
References
  • 1 http//www.nist.gov/speech/tests/tdt/
  • 2 Nicola Stokes, Joe Carthy, Combining Semantic
    and Syntactic Document Classifiers to Improve
    First Story Detection, Department of Computer
    Science, University College Dubin
  • 3 James Allan, Victor Lavrenko, and Hubert Jin,
    First Story Detection In TDT Is Hard,
    http//ciir.cs.umass.edu/pubfiles/ir-206.pdf
  • 4 Frederick Walls, Hubert Jin, Sreenivasa
    Sista, and Richard Schwartz, Topic Detection in
    Broadcast News, http//www.nist.gov/speech/publica
    tions/darpa99/pdf/tdt320.pdf

52
  • Questions and Comments

Thank you!
Write a Comment
User Comments (0)
About PowerShow.com