Title: Generating topic signatures from a corpus of elearning content.
1Generating topic signatures from a corpus of
e-learning content.
- Marios N. Stamoulos
- Chris Bowerman
- Michael P. Oakes
- University of Sunderland
2Outline
- Topic signatures Definition
- Statistical models used in our work
- Building topic signatures automatically using
e-learning material - Evaluation of the topic signature building
algorithm - Further work
3Topic Signatures Definition
- Topic signatures are a set of topic words and
other words semantically related to them that
together uniquely identify the topic (Chin-Yew
Lin, 1997). - A formal definition of a topic signature is
- TS topic, signature
- topic ,lt(w1,s1), , (wi,si)gt
- Where topic is the target concept that we
are investigating and the signature is a vector
of related terms that are highly correlated with
the topic - The strength of each associated word can be
assigned automatically using statistical models - The number of terms that are contained in the
topic signature are set empirically.
4Topic Signatures Example
- You know a word by the company it keeps (Firth,
1957). - Network a system of interconnected electronic
components or circuits (sense 5 WordNet) - network
- (address,230.39),(ip,250.88),
- (protocol,142.13),(layer,214.99),
- (ethernet,202.64)
-
5Topic signatures Past applications
- Topic summarisation
- Enriching WordNet-Ontology learning
- Word Sense Disambiguation
- Question Answering e.g. personal profiles, Who
is X?
6Topic Signatures Why do we need them? (Biryukov
et al., 2005)
- Capture information on a level deeper than the
word - Extends topic identification to the concept level
while not relying on knowledge bases or complex
semantic parsers - Assumes semantically related terms tend to
co-occur in the same context -
7Topic signature acquisitionOur method
- Generate corpus of CISCO training materials in
computer networking - The full corpus generated is treated as a
learning object. - Identify keywords (the most important words
describing the topic of each text) using
statistical methods (TF-IDF, DFR, Log Likelihood) - Get signature terms (co-occurring words) for each
of the keywords.
8Statistical Measures for topic extraction TF-IDF
9Statistical Measures for topic extraction DFR
- Divergence from randomness model
with
10Statistical Measures Log Likelihood (G-Test) I
- Used the following contingency table to extract
terms that contain topic information about each
document in the learning object.
11Statistical Measures for topic extraction Log
Likelihood (G-Test) II
12Finding signature words using TF-IDF
- Create sub-corpus of the elite set, those
documents containing the topic keyword(s) - Assign weights to each term in the elite set
using the tf-idf measure. This way we identify
terms which co-occur with the topic term. - As term frequency use the frequency in the sub
corpus and for IDF use the sub corpus (worked
best) - As term frequency use the total frequency in the
learning object and for IDF use the whole corpus
(worked less well)
13Finding signature words using Log likelihood
(G-test) I
- Create an elite set of documents that contain the
topic term - Create contigency tables as the one below for
each term in the elite set.
14Finding signature words using Log likelihood
(G-test) II
- The observed frequencies, in the formula are
- 01ftd (Frequency of the term in the elite set)
- 02ft,d (Frequency of the term in the rest of
the learning object) - The expected frequencies are calculated using
- E1 (ftdftd) (ftd ftd) / (ftd ftd
ftd ft,d) - E2 (ftdftd) (ftd ftd) / (ftd ftd
ftd ft,d) - The log likelihood ratio is calculated using
-
15Evaluation of signature
- Evaluation of topic terms
- A domain expert was used to evaluate the topics
generated. Each of the topic terms were rated
(Good, Acceptable, Bad) - Evaluation of signature terms
- The topic signatures generated are evaluated
manually by a domain expert and different scores
were assigned to them (Good, Acceptable, Bad)
16Topic Signature created
TS network, lab,259.74),(ip,250.89),(cable,248.9
5), (address,230.4),(layer,215),(ethernet,202.65),
(routing,186.7), (tcp,173.86),(subnet,169.56),(ad
dresses,165.21), (bits,151.79),(frame,143.84),(pro
tocol,142.14), (data,141.39),(class,138.81),(model
,137.82),(activity,137.38) ,(host,132.59),
(fiber,131.24),(protocols,130.42),
(binary,128.23), (internet,127.69)
TS address,(routing,120.73),(bits,119.2),(lab,11
5.69), (subnet,114.76),(frame,102.18),(class,96.49
),(layer,89.43), (ethernet,85.99),(host,83.64),(ip
,81.77),(addresses,80.31), (data,79.35),(peer,79.0
4),(arp,78.42),(server,76.96), (switch,75.87),(des
tination,72.6),(router,69.83),(bridge,68.22),(devi
ces,67.92),(collision,67.44),(broadcast,67.06)
17TF-IDF topic retrieval results
18DFR Topic retrieval results
19Log Likelihood Topic retrieval results
20Further work
- Include bi-grams and tri-grams in the topic terms
- Evaluate further the topics and signatures
created by using them as the background knowledge
within a natural language processing application.
21References
- Biryukov, M., Angheluta, R., Moens
M.2005.Multidocument Question Answering Text
Summarization Using Topic Signatures. Journal on
Digital Information Management (JDIM) Volume
3.Issue 1 - Chin-Yew Lin, Eduard Hovy 2000. The Automated
acquisition of topic signatures for text
summarization.18th International Conference on
Computational Linguistics. - Agirre E. et. al.2001. Enriching WordNet
concepts with topic signatures Workshop on
WordNet of the NAACL01 Conference - Wang X.,2004.Automatic acquisition of English
topic signatures based on a second
language.Proceedings of the Student Research
Workshop at ACL 2004, 2004