Supporting e-learning with automatic glossary extraction Experiments with Portuguese - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Supporting e-learning with automatic glossary extraction Experiments with Portuguese

Description:

Improve retrieval and accessibility of LO in learning management systems ... tok base='rede' class='word' ctag='CN' id='t9035' msd='fs' sp='y' rede /tok ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 23
Provided by: lt4
Category:

less

Transcript and Presenter's Notes

Title: Supporting e-learning with automatic glossary extraction Experiments with Portuguese


1
Supporting e-learning with automatic
glossaryextraction Experiments with Portuguese
  • Rosa Del Gaudio, António Branco
  • RANLP, Borovets 2007

2
Presentation Plan
  • LT4eL project
  • ILIAS
  • Corpus
  • Tool
  • Grammars
  • Copula
  • Other Verbs
  • Punctuation
  • Results
  • Conclusion

3
LT4eL
  • Improve retrieval and accessibility of LO in
    learning management systems
  • Employ language technology resources and tools
    for the semi-automatic generation of descriptive
    metadata .
  • Develop new functionalities such as a key word
    extractor and a glossary candidate detector,
    semantic search, tuned for the various languages
    addressed in the project (Bulgarian, Czech,
    Dutch, English, German, Maltese, Polish,
    Portuguese, Romanian).

4
ILIAS
5
Objective
  • Build a Glossary in an automatic way to support
    e-learning process. In practice this means to
    extract a definition from unstructured text
    (scientific papers, enciclopedia, web pages)
  • Better access to information for student
  • Accelerate the work of the tutor

6
ILIAS Glossary Candidate Detector
7
The Corpus
  • 274.000 tokens
  • Tutorials
  • PhD Thesis
  • Scientific papers
  • 3 Domains evenly represented
  • e-learning
  • Technology for non experts
  • Calimera

8
XML format
  • ltdefiningText continue"y" def"m147"
    def_type1"is_def" id"d5"gt
  • ltmarkedTerm dt"y" id"m147" kw"y"gt
  • lttok base"intranet" class"word" ctag"PNM"
    id"t9032" sp"y"gtIntranetlt/tokgt
  • lt/markedTermgt
  • lttok base"ser" class"word" ctag"V" id"t9033"
    msd"pi-3s" sp"y"gtélt/tokgt
  • lttok base"uma" class"word" ctag"UM" id"t9034"
    msd"fs" sp"y"gtumalt/tokgt
  • lttok base"rede" class"word" ctag"CN"
    id"t9035" msd"fs" sp"y"gtredelt/tokgt
  • lttok base"desenvolver,desenvolvido" class"word"
    ctag"PPA" id"t9036" msd"fs" sp"y"gtdesenvolvida
    lt/tokgt
  • lttok base"para" class"word" ctag"PREP"
    id"t9037" sp"y"gtparalt/tokgt
  • lttok base"processamento" class"word" ctag"CN"
    id"t9038" msd"ms" sp"y"gtprocessamentolt/tokgt
  • lttok base"de" class"word" ctag"PREP"
    id"t9039" sp"y"gtdelt/tokgt
  • lttok base"informação" class"word" ctag"CN"
    id"t9040" msd"fp" sp"y"gtinformaçõeslt/tokgt
  • lttok base"em" class"word" ctag"PREP"
    id"t9041" sp"y"gtemlt/tokgt
  • lttok base"uma" class"word" ctag"UM" id"t9042"
    msd"fs" sp"y"gtumalt/tokgt
  • lttok base"empresa" class"word" ctag"CN"
    id"t9043" msd"fs" sp"y"gtempresalt/tokgt
  • lttok base"ou" class"word" ctag"CJ" id"t9044"
    sp"y"gtoult/tokgt
  • lttok base"organização" class"word" ctag"CN"
    id"t9045" msd"fs"gtorganizaçãolt/tokgt
  • lttok class"punctuation" ctag"PNT" id"t9046"
    sp"y"gt.lt/tokgt
  • lt/definingTextgt

9
LxTransduce
  • Match tree using elements
  • Quick
  • Unicode friendly
  • freeware
  • Easy to integrate in other tools (java)
  • Input simple text or xml
  • Regular expressions
  • Substitution and markup
  • Output the same file with changes

10
Rules in lxtransduce
  • ltrule name"Conj"gt
  • ltquery match"tok_at_ctag 'CJ'"/gt
  • lt/rulegt
  • ltrule name"Coor"gt lt!--Conjunctions or comma --gt
  • ltfirstgt
  • ltquery match"tok. ','"/gt
  • ltref name"Conj" mult""/gt
  • lt/firstgt
  • lt/rulegt
  • ltrule name"PARopen"gt
  • ltquery match"tok.'\('"/gt
  • lt/rulegt
  • ltrule name"PARcl"gt
  • ltquery match"tok.'\('"/gt
  • lt/rulegt
  • ltrule name"parenthetic"gt
  • ltseqgt
  • ltref name"PARopen"/gt
  • ltrepeat-until name"tok"gt
  • ltref name"PARcl"/gt
  • lt/repeat-untilgt
  • ltref name"PARcl"/gt
  • lt/seqgt
  • lt/rulegt

11
First developmentphase
  • Less than 50 of the corpus
  • Focus on the verb
  • Precision manually marked/all automatic
  • Recall correct automatic/manually marked
  • F2 3(precisionrecall)/2precisionrecall

12
Second developing phase
  • 75 of the corpus for developing
  • 25 of the corpus for testing
  • Specific grammar/rules for each type

13
Copula baseline grammar
  • Verb to be third person singular or plural
    present indicative
  • ltrule name"SERdef"gt
  • ltbestgt
  • ltref name"Ser3"/gt
  • ltref name"PoderSer"/gt
  • lt/bestgt
  • lt/rulegt
  • ltrule name"euristic"gt
  • ltseqgt
  • ltrepeat-until name"tok"gt
  • ltref name"SERdef" mult""/gt
  • lt/repeat-untilgt
  • ltref name"SERdef" mult""/gt
  • ltnotgt
  • ltref name"PPA"/gt
  • lt/notgt
  • ltref name"tok" mult""/gt
  • ltend/gt
  • lt/seqgt
  • lt/rulegt

14
Copula base result
  • Sentence level results
  • Problem with precision

15
Copula Grammar
16
Rules for is_type
  • lt!-- To Be 3rd person pl and s --gt
  • ltrule name"Serdef"gt
  • ltquery
  • match"tok_at_ctag V and _at_baseser and
  • (_at_msdstarts-with(.,fi-3 )
  • or _at_msdstarts-with(.,pi-3 ))
  • lt/rulegt
  • ....
  • ltrule name"copula1"gt
  • ltseqgt
  • ltref name"SERdef"/gt
  • ltbestgt
  • ltseqgt
  • ltref name"Art"/gt
  • ltref name"adjadvprep" mult""/gt
  • ltref name"Noun" mult""/gt
  • lt/seqgt
  • ....
  • lt/bestgt
  • ltref name"tok" mult""/gt
  • ltend/gt
  • lt/seqgt
  • lt/rulegt

17
Confronting Results
Include that patterns that were excluded Try to
gather the syntactic pattern of non definition
and confront with the syntactic pattern of
definition.
18
Other_Verbs grammar
  • ltrule name"Vpas"gt
  • ltseqgt
  • ltref name"tok"/gt
  • ltnotgt
  • ltref name"not"/gt
  • lt/notgt
  • ltref name"tok" mult"?"/gt
  • ltquery match"tokmylex(_at_base) and
    (_at_ctag'PPA')" constraint"mylex(_at_base)/cat'pas'
    "/gt
  • lt/seqgt
  • lt/rulegt
  • Collect verbs in a lexicon
  • Three different category reflexive, active,
    passive.
  • 22 different verbs
  • ltlex word"chamar"gt
  • ltcatgtreflt/catgt
  • lt/lexgt
  • ltlex word"chamar,chamado"gt
  • ltcatgtpaslt/catgt
  • lt/lexgt

19
Results for verb_type
  • Analyze each verbs separately as with is_type
  • Richer syntactic patterns

20
Punctuation Grammar
  • ltrule name"punct_def"gt
  • ltseqgt
  • ltstart/gt
  • ltref name"CompmylexSN" mult""/gt
  • ltquery match"tok.\"/gt
  • ltref name"tok" mult""/gt
  • ltend/gt
  • lt/seqgt
  • lt/rulegt
  • Preliminary work
  • Definition introduced by colon mark (most
    frequent)

21
All-in-one
  • Combination of the previous grammars
  • The type is not take into account to calculate
    precision and recall

22
Conclusions and Future Work
  • Overall results Recall 86, Precision 14
  • Difference among domains the style of a document
    influence the result.
  • Improve the rules for verb_type and punc_type
  • Combining with other techniques such as ML
Write a Comment
User Comments (0)
About PowerShow.com