Evaluating TexttoSpeech Synthesis for use in ComputerAssisted Language Learning - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Evaluating TexttoSpeech Synthesis for use in ComputerAssisted Language Learning

Description:

Practice of intonation and prosody (the music of speech) Auditory discrimination. Repetition ... Naturalness: exaggerated intonation ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 52
Provided by: Hand64
Category:

less

Transcript and Presenter's Notes

Title: Evaluating TexttoSpeech Synthesis for use in ComputerAssisted Language Learning


1
Evaluating Text-to-Speech Synthesis for use in
Computer-Assisted Language Learning
  • Zöe Handley
  • LSRI, University of Nottingham
  • January 2008

2
Context
  • EU Project FreeText
  • An advanced hypermedia CALL system featuring NLP
    tools for a smart treatment of authentic
    documents and free production exercises (Hamel et
    al., 2000)

3
Plan
  • TTS synthesis in CALL
  • Evaluation
  • Requirements analysis
  • Validation of requirements
  • Readiness of TTS synthesis for CALL
  • Conclusions

4
What is TTS synthesis?
  • Speech synthesis
  • systems that allow the generation of novel
    messages, either from scratch (i.e. entirely by
    rule) or by re-combining shorter pre-stored
    units (van Bezooijen and van Heuven, 1997 709)
  • Text-to-Speech Synthesis
  • The automatic generation of speech from text

5
What is Text-to-Speech Synthesis?
  • http//www.acapela-group.com/text-to-speech-intera
    ctive-demo.html

6
TTS synthesis Why now?
  • The challenge of TTS synthesis
  • The man (and he certainly was one!) just said,
    Maybe. Ill see. I cant promise.
    (McAllister, 1989)
  • Dr. Jones lives at 11 School Dr. and works on the
    corner of St. James St. (Dutoit, 1997)
  • Rough, through, bough, thought, dough, cough, and
    hiccough.
  • (Divay and Vitale, 1997)

A simple but general diagram of a TTS system
(Dutoit, 1997)
7
TTS synthesis Why now? (cont.)
  • Parametric synthesis
  • Simulation of the acoustic acoustic signal
  • Formant synthesis
  • Concatenative synthesis
  • Concatenation of pre-recorded segments of natural
    human speech

The spectrogram for the word phonetician
8
TTS synthesis Why now?
  • Formant synthesis
  • Concatenative synthesis

http//www.speaknspell.co.uk/speaknspell.html
http//www.acapela-group.com/text-to-speech-intera
ctive-demo.html
9
Why TTS synthesis in CALL?
  • There is a general need in language learning and
    teaching for self-paced interactive learning
    environments which provide controlled
    interactive speaking practice outside the
    classroom (Ehsani and Knodt, 1998 45).

10
CALL applications Reading Machine
  • Talking dictionary
  • Talking text
  • Talking word processor
  • Talking conjugator
  • Dictation
  • Grapheme?phoneme exercises

Oxford Hachette 4 French Dictionary on CD-ROM
11
CALL applications Pronunciation model
  • Practice of individual and combined phonemes
  • Auditory discrimination
  • Repetition
  • Practice of intonation and prosody (the music of
    speech)
  • Auditory discrimination
  • Repetition

SAFexo (Hamel, 1998 2003)
12
CALL applications Dialogue partner
  • In combination with automatic speech recognition,
    speech understanding, the generative power of TTS
    synthesis can be harnessed to provide learners
    with interactive speaking practice, i.e. a
    dialogue partner

Mr Smoketoomuch Monty Python sketch (KTH, 1999)
http//www.speech.kth.se/
13
Benefits of TTS synthesis
  • Easy creation and editing of speech samples
  • Simultaneous presentation of text and speech
  • Low storage requirements
  • Non-human and therefore perceived as
    non-judgemental
  • Improves on possibilities other media provide,
    but does not add value, i.e. bring about new
    possibilities

14
Benefits of TTS synthesis
  • Generation of examples on demand (Sherwood, 1981)
    and therefore the automatic generation of
    feedback, conversational turns, and exercises
    with speech models
  • Adds value to CALL, i.e. brings about new
    possibilities such as provision of interactive
    conversations

15
Why evaluation?
  • Few CALL applications integrating TTS synthesis
    are available on the market
  • Since the failure of the language laboratory
    teachers have been sceptical about unevaluated
    technologies
  • The most common role that TTS synthesis assumes
    outside CALL is that of a reading machine

16
Evaluation of TTS synthesis for CALL
  • CALL evaluation framework (Chapelle, 2001)
  • Judgemental evaluation of the application
  • Judgemental evaluation of the planned activity
  • Evaluation of learners performance
  • Product oriented
  • Process oriented
  • Speech and Language Technology (SALT) evaluation
    framework (Paroubek and Blasband, 1999)
  • Basic research evaluation
  • Technology evaluation
  • Usage evaluation

17
Framework for the evaluation of TTS synthesis for
use in CALL
  • Level 1
  • Viability and potential benefits of the use of
    TTS synthesis in CALL
  • Level 2
  • Adequacy of TTS synthesis for use in CALL
  • Level 3
  • Potential of the CALL program to provide ideal
    conditions for SLA
  • Level 4
  • Potential of the planned activity to provide
    ideal conditions for SLA
  • Level 5
  • Learners performance in the planned activity
  • Level 6
  • Success of the funding program

18
Evaluations of TTS Synthesis for CALL
  • Technology evaluations
  • Stratil et al (1987a)
  • Evaluated the quality of a Spanish TTS chip for
    use for the presentation of grammar exercises in
    a language laboratory.
  • Usage evaluations outcome-oriented
  • Santiao-Oriola (1999)
  • Evaluated the use of a French TTS synthesiser for
    the presentation of dictation exercises.
  • Hincks (2002)
  • Evaluated the use of a Swedish TTS synthesiser in
    combination with a speech editor (re-synthesis)
    for teaching the lexical stress of English to
    Swedophones.
  • Usage evaluations process-oriented
  • Cohen (1993)
  • Evaluated the use of a talking word processor to
    support literacy activities, namely writing
    stories, for young learners of French as a
    second language.
  • Impact evaluations
  • Stratil et al (1987b)
  • Evaluated user reactions to the use of Spanish
    TTS chip for the presentation of grammar
    exercises in a language laboratory.

19
The evaluation process
  • ISO (1999) and EAGLES (1999) guidelines
  • Establish the evaluation requirements
  • Establish the purpose of the evaluation
  • Identify the types of products to be evaluated
  • Specify the quality model
  • Specify the evaluation
  • Select metrics
  • Establish rating levels for metrics
  • Establish criteria for assessment
  • Design the evaluation
  • Execute the evaluation

20
CALL requirements
  • When the language competence of the system
    begins to outstrip that of some of the better
    second language users, such systems become useful
    adjunct tools (Keller and Zellner-Keller, 2000)

21
CALL requirements analysis
  • Ideal conditions for Second Language Acquisition
    (SLA) (Chapelle, 2001)
  • Language learning potential
  • Goals of SLA
  • Communicative competence
  • Quality of the output
  • Primary requirement Comprehensibility/intelligibi
    lity
  • Secondary requirements Accuracy and naturalness
  • At both the level of individual speech sounds and
    the prosodic level
  • Focus on form
  • Flexibility
  • Speech rate, pitch

22
Explorative investigation
  • Research questions
  • Do the different roles identified impose
    different requirements on the quality of speech
    synthesis?
  • Does comprehensibility account for acceptability
    for use in CALL?

23
Design
  • Within subjects
  • N 17, French Teachers
  • Dependent variables
  • Comprehensibility
  • Acceptability
  • Appropriateness
  • Frequency and seriousness of errors
  • Independent variables
  • Role of TTS
  • Reading machine
  • Pronunciation model
  • Dialogue partner

24
Materials and apparatus
  • 1 research TTS system
  • FIPSVox, University of Geneva
  • Diphone concatenation
  • 20 utterances representative of each role
  • Reading machine,
  • Pronunciation model,
  • Dialogue partner
  • Questionnaires
  • Likert scales Comprehension, acceptability,
    appropriateness
  • Word point paradigm (Van Santen, 1993)

25
Scales used
  • Comprehension and acceptability
  • Appropriateness

26
Word pointing paradigm
27
Results
  • Friedman test used to test for difference among
    roles
  • Overall Appropriateness
  • Significant to plt0.05 (?212.182, df2, p0.002,
    two-tailed)
  • Overall Acceptability
  • Significant to plt0.05 (?29.5, df2, p0.009,
    two-tailed)
  • Overall Comprehensibility
  • Significant to plt0.05 (?218.667, df2, plt0.001,
    two-tailed)

28
Results Relationship between comprehensibility
and acceptability
  • Spearmans Rho used to test for correlation
  • Reading Positively and strongly related
    (rho0.793, N12, p0.001, one-tailed)
  • Pronunciation Positively related, but not
    strongly (rho0.547, N12, p0.033, one-tailed)
  • Conversation Positively related, but not
    strongly (rho0.504, N12, p0.047, one-tailed)

29
Results (cont.)
  • Types of errors highlighted
  • Accuracy bad segments, bad words, bad phrasing,
    inappropriate intonation, bad sentence stress,
    inappropriate rhythmn
  • Naturalness exaggerated intonation
  • Register (formality) inappropriate
    dropping/retention of schwa, inappropriate
    omision/insertion of liaison
  • Expressiveness lacked emotion
  • Most frequent errors
  • Inappropriate intonation, inappropriate rhythmn,
    bad phrasing

30
Conclusion
  • Requirements differ
  • Most suitable as a dialogue partner
  • Surprising as speech database is read
  • Could be because utterances to synthesise are
    shorter and less complex than in the role of
    reading machine
  • Least suitable as a pronunciation model
  • Comprehensibility is not the only requirement
  • Accuracy and naturalness matter
  • Further requirements not highlighted by the
    literature

31
Investigation 2
  • Research questions
  • Do the different roles identified impose
    different requirements on the quality of speech
    synthesis?
  • Is TTS synthesis ready for use in CALL?

32
Design
  • Within subjects
  • N 17, French Teachers
  • Dependent variables
  • Quality of the speech output
  • Acceptability
  • Adequacy (appropriateness)
  • Readiness
  • Independent variables
  • Role of TTS
  • Reading machine
  • Pronunciation model
  • Segmental (or phonetic) level
  • Suprasegemental (or prosodic) level
  • Conversational partner
  • TTS synthesis system

33
Systems evaluated
  • http//www.research.att.com/ttsweb/tts/demo.phpt
    op
  • http//212.8.184.250/tts/demo_login.jsp
  • http//www.multitel.be/TTS/layout.php?pageeLite_d
    emo
  • http//www.acapela-group.com/text-to-speech-intera
    ctive-demo.html

34
Questionnaire
  • MOS-CALL
  • Based on
  • ITU-T Overall Quality Test
  • MOS-X (Polkosky and Lewis, 2003)

35
On-line presentation
36
System 1 ATT Next-Gen (Alain)
Mean ratings of adequacy and acceptability
  • Analysis of the data using the Friedman test
    revealed that these differences were, however,
    not statistically significant (?²r 2.352, df
    3, p 0.503 ?²r 6.616, df 3, p 0.085,
    respectively).

37
System 1 ATT Next-Gen (Alain)
Mean ratings of quality of output
Significance of differences in ratings of quality
of output across roles
38
System 2 Nuance Vocalizer (Julie)
Mean ratings of adequacy and acceptability
  • Analysis of the data using the Friedman test
    revealed that these differences were significant
    for adequacy (?²r 8.010, df 3, p 0.046),
    but not for acceptability (?²r 6.303, df 3, p
    0.098).

39
System 2 Nuance Vocalizer (Julie)
Mean ratings of quality of output
Significance of differences in ratings of quality
of output across roles
40
System 3 eLite (Vincent)
Mean ratings of adequacy and acceptability
  • Like for S1, the test revealed that the
    differences were not statistically significant
    for either adequacy or acceptability (?²r
    3.467, df 3, p 0.325 ?²r 3.194, df 3, p
    0.363, respectively).

41
System 3 eLite (Vincent)
Mean ratings of quality of output
Significance of differences in ratings of quality
of output across roles
42
System 4 BrightSpeech (Julie)
Mean ratings of adequacy and acceptability
  • Analysis of the data using the Friedman test
    revealed that the differences were statistically
    significant for adequacy (?²r 8.063, df 3, p
    0.045), but not for acceptability (?²r 5.547,
    df 3, p 0.163).

43
System 4 BrightSpeech (Julie)
Mean ratings of quality of output
Significance of differences in ratings of quality
of output across roles
44
Discussion
  • It is not clear whether the different roles
    impose different requirements on the quality of
    TTS synthesis
  • The aspects of quality for which statistically
    significant differences were found differed
    across the TTS synthesis systems
  • Possible explanations for results
  • There are only small differences
  • Participants do not have enough context in order
    to rate the quality of the speech for use in the
    different roles
  • The similarity in requirements of the two types
    of pronunciation model

45
Is TTS synthesis ready for use in CALL?
Mean ratings of adequacy
  • Different TTS synthesis systems are most suitable
    for use in different roles
  • Reinforces the need to evaluate every TTS
    synthesis system
  • System 4 is ready for use in all applications
    where TTS synthesis adds value

Mean ratings of acceptability
46
System 1 ATT Next-Gen (Alain)
Mean ratings of quality of output
47
System 2 Nuance Vocalizer (Julie)
Mean ratings of quality of output
48
System 3 eLite (Vincent)
Mean ratings of quality of output
49
Conclusions
  • CALL imposes requirements on the following
    aspects of the quality of the output of TTS
    synthesis systems
  • Comprehensibility, accuracy, naturalness,
    expressiveness
  • Further research is necessary to determine
    whether the different roles have different
    requirements
  • Some French TTS synthesis systems are reaching
    readiness for use in CALL in applications which
    add value
  • In order to fully meet the requirements of CALL
    more attention needs to be paid to accuracy and
    naturalness, in particular at the prosodic level,
    and expressiveness
  • This may not be the case for all languages
    different languages pose different problems to TTS

50
Recent developments
  • Hybrid systems
  • FlexVoice (Venkatagiri, 2003).
  • Parametric good at synthesising vowels
  • Concatenative good at synthesising consonants
  • Emotional TTS
  • http//www.loquendo.com/en/technology/emotional_tt
    s.htm
  • Flexibility
  • Blizzard Challenge to develop a voice in a month
  • http//www.synsig.org/index.php/Blizzard_Challenge
    _2007
  • CapturaTalk
  • Take a picture, hear the words

http//www.capturatalk.com/
51
Publications and References
  • Publications
  • The research presented in this seminar is
    presented in more detail in
  • Handley, Z. and Hamel, M.-J. (2005). Establishing
    a Methodology for Benchmarking Speech Synthesis
    for Computer-Assisted Language Learning (CALL).
    Language Learning Technology Journal. 9 (3)
    99-119. http//llt.msu.edu/vol9num3/handley/defaul
    t.html
  • Handley, Z. and Hamel, M.-J. (2004).
    Investigating the Requirements of Speech
    Synthesis for CALL with a View to Developing a
    Benchmark. In Procs. InSTIL/ICALL 2004 (pp.
    71-74). Venice, Italy. http//sisley.cgm.unive.it/
    ICALL2004/papers/018Handley.pdf
  • References
  • Chapelle (2001) Computer Applications in Second
    Language Acquisition. Cambridge Cambridge
    University Press.
  • Cohen (1993) The use of a voice synthesizer in
    the discovery of the written language by young
    children. Computers in Eudcation. 21 (1/2) 25-30
  • Dutoit (1997) An Introduction to Text-to-Speech
    Synthesis. London Kluwer Academic Publishers.
  • Ehsani and Knodt (1998) Speech technology in
    computer-aided language learning Strengths and
    limitations of a new CALL paradigm. Language
    Learning Technology. 2 (1) 45-60
  • Hamel (1998) Les outils de TALN dans SAFRAN.
    RECALL Journal 10 (1) 79-85
  • Hamel (2000). FreeText - An advanced hypermedia
    CALL system featuring NLP tools for a smart
    treatment of authentic documents and free
    production exercises. Canadian Association of
    Applied Linguistics 2000, Edmonton (Canada), May
    2000.
  • Hincks (2002) Speech Synthesis for Teaching
    Lexical Stress. TMH-QPRS 44 135-165
  • ISO (1999) Information Technology Software
    Product Evaluation Part 1 General Overview.
    ISO
  • Keller and Zellner-Keller (2000) Speech synthesis
    in language learning challenges and
    opportunities. In Procs. InSTIL 2000 (pp.
    109-116). Dundee, Rngland University of Abertay
    Dundee.
  • Paroubek and Blasband (1999) ELSE Executive
    Summary (short version). http//www.limsi.fr/TLP/E
    LSE/PreambleXwhyXwhatXrev3.htm
  • Polkosky and Lewis (2003) Expanding the MOS
    Development and psychometric evaluation of the
    MOS-R and MOS-X. International Journal of Speech
    Technology. 6 161-182
  • Santiago-Oriola (1999) Vocal Synthesis in a
    Computerized Dictation Exercise. In EUROSPEECH99
    (pp. 191-194). Budapest, Hungary.
  • Stratil et al (1987a) Exploration of Foreign
    Language Speech Synthesis. Literacy and
    Linguistic Computing. 2 (2) 116-119
Write a Comment
User Comments (0)
About PowerShow.com