Generating topic signatures from a corpus of elearning content. - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Generating topic signatures from a corpus of elearning content.

Description:

... {network, lab,259.74),(ip,250.89),(cable,248.95), (address,230.4),(layer,215) ... cabling. 146.65. vpns. 147.34. numbers. 148.59. devices. Further work ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 22
Provided by: osirisSun
Category:

less

Transcript and Presenter's Notes

Title: Generating topic signatures from a corpus of elearning content.


1
Generating topic signatures from a corpus of
e-learning content.
  • Marios N. Stamoulos
  • Chris Bowerman
  • Michael P. Oakes
  • University of Sunderland

2
Outline
  • Topic signatures Definition
  • Statistical models used in our work
  • Building topic signatures automatically using
    e-learning material
  • Evaluation of the topic signature building
    algorithm
  • Further work

3
Topic Signatures Definition
  • Topic signatures are a set of topic words and
    other words semantically related to them that
    together uniquely identify the topic (Chin-Yew
    Lin, 1997).
  • A formal definition of a topic signature is
  • TS topic, signature
  • topic ,lt(w1,s1), , (wi,si)gt
  • Where topic is the target concept that we
    are investigating and the signature is a vector
    of related terms that are highly correlated with
    the topic
  • The strength of each associated word can be
    assigned automatically using statistical models
  • The number of terms that are contained in the
    topic signature are set empirically.

4
Topic Signatures Example
  • You know a word by the company it keeps (Firth,
    1957).
  • Network a system of interconnected electronic
    components or circuits (sense 5 WordNet)
  • network
  • (address,230.39),(ip,250.88),
  • (protocol,142.13),(layer,214.99),
  • (ethernet,202.64)

5
Topic signatures Past applications
  • Topic summarisation
  • Enriching WordNet-Ontology learning
  • Word Sense Disambiguation
  • Question Answering e.g. personal profiles, Who
    is X?

6
Topic Signatures Why do we need them? (Biryukov
et al., 2005)
  • Capture information on a level deeper than the
    word
  • Extends topic identification to the concept level
    while not relying on knowledge bases or complex
    semantic parsers
  • Assumes semantically related terms tend to
    co-occur in the same context

7
Topic signature acquisitionOur method
  • Generate corpus of CISCO training materials in
    computer networking
  • The full corpus generated is treated as a
    learning object.
  • Identify keywords (the most important words
    describing the topic of each text) using
    statistical methods (TF-IDF, DFR, Log Likelihood)
  • Get signature terms (co-occurring words) for each
    of the keywords.

8
Statistical Measures for topic extraction TF-IDF
  • TF-IDF

9
Statistical Measures for topic extraction DFR
  • Divergence from randomness model

with
10
Statistical Measures Log Likelihood (G-Test) I
  • Used the following contingency table to extract
    terms that contain topic information about each
    document in the learning object.

11
Statistical Measures for topic extraction Log
Likelihood (G-Test) II
12
Finding signature words using TF-IDF
  • Create sub-corpus of the elite set, those
    documents containing the topic keyword(s)
  • Assign weights to each term in the elite set
    using the tf-idf measure. This way we identify
    terms which co-occur with the topic term.
  • As term frequency use the frequency in the sub
    corpus and for IDF use the sub corpus (worked
    best)
  • As term frequency use the total frequency in the
    learning object and for IDF use the whole corpus
    (worked less well)

13
Finding signature words using Log likelihood
(G-test) I
  • Create an elite set of documents that contain the
    topic term
  • Create contigency tables as the one below for
    each term in the elite set.

14
Finding signature words using Log likelihood
(G-test) II
  • The observed frequencies, in the formula are
  • 01ftd (Frequency of the term in the elite set)
  • 02ft,d (Frequency of the term in the rest of
    the learning object)
  • The expected frequencies are calculated using
  • E1 (ftdftd) (ftd ftd) / (ftd ftd
    ftd ft,d)
  • E2 (ftdftd) (ftd ftd) / (ftd ftd
    ftd ft,d)
  • The log likelihood ratio is calculated using

15
Evaluation of signature
  • Evaluation of topic terms
  • A domain expert was used to evaluate the topics
    generated. Each of the topic terms were rated
    (Good, Acceptable, Bad)
  • Evaluation of signature terms
  • The topic signatures generated are evaluated
    manually by a domain expert and different scores
    were assigned to them (Good, Acceptable, Bad)

16
Topic Signature created
TS network, lab,259.74),(ip,250.89),(cable,248.9
5), (address,230.4),(layer,215),(ethernet,202.65),
(routing,186.7), (tcp,173.86),(subnet,169.56),(ad
dresses,165.21), (bits,151.79),(frame,143.84),(pro
tocol,142.14), (data,141.39),(class,138.81),(model
,137.82),(activity,137.38) ,(host,132.59),
(fiber,131.24),(protocols,130.42),
(binary,128.23), (internet,127.69)
TS address,(routing,120.73),(bits,119.2),(lab,11
5.69), (subnet,114.76),(frame,102.18),(class,96.49
),(layer,89.43), (ethernet,85.99),(host,83.64),(ip
,81.77),(addresses,80.31), (data,79.35),(peer,79.0
4),(arp,78.42),(server,76.96), (switch,75.87),(des
tination,72.6),(router,69.83),(bridge,68.22),(devi
ces,67.92),(collision,67.44),(broadcast,67.06)
17
TF-IDF topic retrieval results
18
DFR Topic retrieval results
19
Log Likelihood Topic retrieval results
20
Further work
  • Include bi-grams and tri-grams in the topic terms
  • Evaluate further the topics and signatures
    created by using them as the background knowledge
    within a natural language processing application.

21
References
  • Biryukov, M., Angheluta, R., Moens
    M.2005.Multidocument Question Answering Text
    Summarization Using Topic Signatures. Journal on
    Digital Information Management (JDIM) Volume
    3.Issue 1
  • Chin-Yew Lin, Eduard Hovy 2000. The Automated
    acquisition of topic signatures for text
    summarization.18th International Conference on
    Computational Linguistics.
  • Agirre E. et. al.2001. Enriching WordNet
    concepts with topic signatures Workshop on
    WordNet of the NAACL01 Conference
  • Wang X.,2004.Automatic acquisition of English
    topic signatures based on a second
    language.Proceedings of the Student Research
    Workshop at ACL 2004, 2004
Write a Comment
User Comments (0)
About PowerShow.com