Sira E. Palazuelos Cagigas, Jos - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Sira E. Palazuelos Cagigas, Jos

Description:

Departamento de Electr nica. Universidad de Alcal . Alcal de Henares. Espa a ... de Ingenier a Electr nica. ETSI de Telecomunicaci n. UPM. Madrid. Espa a ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 20
Provided by: LTR
Category:

less

Transcript and Presenter's Notes

Title: Sira E. Palazuelos Cagigas, Jos


1
Design and Evaluation of a Versatile Architecture
for a Multilingual Word Prediction System
  • Sira E. Palazuelos Cagigas, José L. Martín
    Sánchez, Lisset Hierrezuelo Sabatela
  • Departamento de Electrónica. Universidad de
    Alcalá.
  • Alcalá de Henares. España
  • Javier Macías Guarasa
  • Dpto. de Ingeniería Electrónica. ETSI de
    Telecomunicación. UPM.
  • Madrid. España

ICCHP06
2
Overview
  • Introduction
  • Word prediction system
  • Description of the prediction systems for each
    language
  • Evaluation
  • Conclusions

3
Introduction (I)
  • Word prediction is the set of techniques that try
    to predict the word a person is typing
  • Examples

La casa de los es este esta
estado
La casa de los espíritus
La casa de los e el es
este
La casa de los esp español
española especial
La casa de los espí espíritu
espías espía
La casa de los espír espíritu
espíritus
La casa de los e ensayos
elementos estudios
La casa de los espíritus
La casa de los es estudios
espíritus estados
La casa de los espíritus. Dichos espíritus
La casa de los espíritus. Dichos e
espíritus ensayos
elementos
4
Introduction (II)
  • Justification
  • People with physical disabilities
  • Computer access for writing of communication

5
WPS General architecture
  • Main features
  • Modularity separation between information
    sources and prediction methods
  • Flexibility task and language independent
  • Power integration of multiple information sources

Management module
Training module
6
WPS Lexicons
  • Main lexicon
  • Word form and all the possible lemmas of each
    word.
  • Probabilistic information.
  • Grammatical information POS and features.
  • Dynamic lexicons subject and personal lexicons
  • User vocabulary (new words, proper names,
    specific vocabulary, etc.).
  • Frequencies dependent on the user and subject.
  • Word pairs.
  • Training from pre-stored texts (subject lexicons)
    or the current text (personal lexicon).
  • Problem Spelling mistakes.

7
WPS Prediction methods
  • Word probabilistic grammars
  • Unigrams, bigrams, trigrams.
  • POS probabilistic grammars
  • Bipos, tripos.
  • Smoothing.
  • Fall back.
  • Basic feature management.
  • Stochastic context free grammar (SCFG)
  • Probabilistic information.
  • Possibility to include in the rules optional
    symbols (with its probability).
  • Features agreement, imposition and prohibition.
  • Word form and lemma prohibition and imposition.

8
WPS Heuristics
  • Elimination of the words previously rejected by
    the user.
  • Prediction of the more frequent word suffixes
    beginning by the last letter.
  • Automatic insertion of spaces after punctuation
    marks.
  • Automatic capitalization after a period.

9
WPS Management module
  • The management module
  • Processes the input from the user interface (text
    written by the user).
  • Manages the information flow between the
    different prediction methods (coordinating the
    data each one needs and provides) and the
    transactions with the lexicons.
  • Obtains the word prediction list that each method
    provides and combines them to send the most
    adequate proposals to the user interface.
  • Applies the heuristics.

10
WPS User interface
11
Description of the prediction systems for each
language
Heuristic/Lexicon/ Word Prediction Algorithm Spanish English Portug. Swedish
Main lexicon ? ? ? ?
Subject lexicon ? ? ? ?
Personal lexicon ? ? ? ?
Unigram ? ? ? ?
2-grams to 6-grams ? ? ?
Static bipos and tripos ? ?
Features management ?
SCFG ?
Suffixes prediction ?
Elimination of rejected words ? ? ? ?
Auto capitalization ? ? ? ?
Spaces after punct. marks ? ? ? ?
12
Evaluation (I)
  • Results of the Spanish word prediction system
    of saved keystrokes

Exp. Configuration Result Relative Impr.
1 Static bipos and tripos and features management 42.7
2 Exp. 1 plus 2-grams to 6-grams 51.9 21.5
3 Exp. 2 plus personal lexicon 53.3 2.7
13
Evaluation (II)
  • Results of the English word prediction system
    of saved keystrokes

Exp. Configuration Result Relative Impr.
1 Static bipos and tripos 28.2
2 Exp. 1 plus 2-grams to 6-grams 37.4 32.6
3 Exp. 2 plus personal lexicon 47.7 27.5
14
Evaluation (III)
  • Results of the Swedish word prediction system
    of saved keystrokes

Exp. Configuration Result Relative Impr.
1 Unigrams 33.8
2 Exp. 1 plus 2-grams to 6-grams 42.7 26.3
3 Exp. 2 plus personal lexicon 47.7 11.7
15
Evaluation (IV)
  • Results of the Portuguese word prediction system
    of saved keystrokes

Exp. Configuration Result Relative Impr.
1 Unigrams 38.2
2 Exp. 1 plus 2-grams to 6-grams 42.8 12.0
3 Exp. 2 plus personal lexicon 45.0 5.1
16
Evaluation (V)
  • The percentage of saved keystrokes is more than
    45 and for words predicted before the user types
    all their letters is usually over 90-95 for all
    the languages, lexicons and methods.
  • The differences between the results are due to
  • The amount of information sources available for
    each language
  • The grammatical information for Spanish has been
    specially designed to optimize the prediction
    process, while the grammatical information
    available for English was the one included in the
    BNC.
  • Agreement between the test and training texts
  • If the subject of the test and training test is
    the same, the prediction obtained by the n-grams
    based methods could be very good, leaving a
    narrow margin to the personal lexicon.
  • For best trained languages (Spanish and English),
    the results for experiment 3 are very similar,
    showing the power of the personal lexicon.

17
Conclusions (I)
  • The architecture, lexicons and word prediction
    methods of a prediction system have been
    described.
  • The system architecture is
  • Modular, with independent modules and well
    defined interfaces between them.
  • Flexible it allows to easily change prediction
    methods or lexicons for the same or a different
    language.
  • The system has been evaluated for Spanish,
    English, Swedish and Portuguese with results of
    more than 45 of saved keystrokes and over 90 of
    predicted words.

18
Conclusions (II)
  • The results of the evaluation show that
  • The architecture is able to efficiently handle
    different languages with similar performance
  • There are important differences when including
    additional information sources in the prediction
    process, when compared with the basic prediction
    methods.
  • The improvements strongly depend on the previous
    information (the availability of the grammatical
    information, features, the amount of words in the
    main lexicon, etc.).
  • N-grams also produced results varying with the
    agreement between the test and training texts.
  • The use of flexible methods, like the personal
    and subject lexicon, produces the best results
    for all the languages, due to their capability to
    adapt to the new vocabulary and frequencies. They
    compensate the shortages of the fixed lexicons.

19
Thank you
  • Thank you for your attention
  • For further information
  • Email to sira_at_depeca.uah.es
  • PhD thesis with the explanation of the
    architecture
  • http//www.depeca.uah.es/personal/sira/Documentos/
    thesisSiraEnglish.pdf
  • http//www.depeca.uah.es/personal/sira/Documentos/
    TraspasTesisIngles47.zip
  • Report Report on Word Prediction for Spanish,
    English and Swedish
  • http//www.depeca.uah.es/personal/sira/Documentos/
    Report on Word Prediction.pdf
  • Papers on Word Prediction (in English or Spanish)
  • http//www.depeca.uah.es/personal/sira/
Write a Comment
User Comments (0)
About PowerShow.com