Title: Evaluating TexttoSpeech Synthesis for use in ComputerAssisted Language Learning
1Evaluating Text-to-Speech Synthesis for use in
Computer-Assisted Language Learning
- Zöe Handley
- LSRI, University of Nottingham
- January 2008
2Context
- EU Project FreeText
- An advanced hypermedia CALL system featuring NLP
tools for a smart treatment of authentic
documents and free production exercises (Hamel et
al., 2000)
3Plan
- TTS synthesis in CALL
- Evaluation
- Requirements analysis
- Validation of requirements
- Readiness of TTS synthesis for CALL
- Conclusions
4What is TTS synthesis?
- Speech synthesis
- systems that allow the generation of novel
messages, either from scratch (i.e. entirely by
rule) or by re-combining shorter pre-stored
units (van Bezooijen and van Heuven, 1997 709) - Text-to-Speech Synthesis
- The automatic generation of speech from text
5What is Text-to-Speech Synthesis?
- http//www.acapela-group.com/text-to-speech-intera
ctive-demo.html
6TTS synthesis Why now?
- The challenge of TTS synthesis
- The man (and he certainly was one!) just said,
Maybe. Ill see. I cant promise.
(McAllister, 1989) - Dr. Jones lives at 11 School Dr. and works on the
corner of St. James St. (Dutoit, 1997) - Rough, through, bough, thought, dough, cough, and
hiccough. - (Divay and Vitale, 1997)
A simple but general diagram of a TTS system
(Dutoit, 1997)
7TTS synthesis Why now? (cont.)
- Parametric synthesis
- Simulation of the acoustic acoustic signal
- Formant synthesis
- Concatenative synthesis
- Concatenation of pre-recorded segments of natural
human speech
The spectrogram for the word phonetician
8TTS synthesis Why now?
http//www.speaknspell.co.uk/speaknspell.html
http//www.acapela-group.com/text-to-speech-intera
ctive-demo.html
9Why TTS synthesis in CALL?
- There is a general need in language learning and
teaching for self-paced interactive learning
environments which provide controlled
interactive speaking practice outside the
classroom (Ehsani and Knodt, 1998 45).
10CALL applications Reading Machine
- Talking dictionary
- Talking text
- Talking word processor
- Talking conjugator
- Dictation
- Grapheme?phoneme exercises
Oxford Hachette 4 French Dictionary on CD-ROM
11CALL applications Pronunciation model
- Practice of individual and combined phonemes
- Auditory discrimination
- Repetition
- Practice of intonation and prosody (the music of
speech) - Auditory discrimination
- Repetition
SAFexo (Hamel, 1998 2003)
12CALL applications Dialogue partner
- In combination with automatic speech recognition,
speech understanding, the generative power of TTS
synthesis can be harnessed to provide learners
with interactive speaking practice, i.e. a
dialogue partner
Mr Smoketoomuch Monty Python sketch (KTH, 1999)
http//www.speech.kth.se/
13Benefits of TTS synthesis
- Easy creation and editing of speech samples
- Simultaneous presentation of text and speech
- Low storage requirements
- Non-human and therefore perceived as
non-judgemental - Improves on possibilities other media provide,
but does not add value, i.e. bring about new
possibilities
14Benefits of TTS synthesis
- Generation of examples on demand (Sherwood, 1981)
and therefore the automatic generation of
feedback, conversational turns, and exercises
with speech models - Adds value to CALL, i.e. brings about new
possibilities such as provision of interactive
conversations
15Why evaluation?
- Few CALL applications integrating TTS synthesis
are available on the market - Since the failure of the language laboratory
teachers have been sceptical about unevaluated
technologies - The most common role that TTS synthesis assumes
outside CALL is that of a reading machine
16Evaluation of TTS synthesis for CALL
- CALL evaluation framework (Chapelle, 2001)
- Judgemental evaluation of the application
- Judgemental evaluation of the planned activity
- Evaluation of learners performance
- Product oriented
- Process oriented
- Speech and Language Technology (SALT) evaluation
framework (Paroubek and Blasband, 1999) - Basic research evaluation
- Technology evaluation
- Usage evaluation
17Framework for the evaluation of TTS synthesis for
use in CALL
- Level 1
- Viability and potential benefits of the use of
TTS synthesis in CALL - Level 2
- Adequacy of TTS synthesis for use in CALL
- Level 3
- Potential of the CALL program to provide ideal
conditions for SLA - Level 4
- Potential of the planned activity to provide
ideal conditions for SLA - Level 5
- Learners performance in the planned activity
- Level 6
- Success of the funding program
18Evaluations of TTS Synthesis for CALL
- Technology evaluations
- Stratil et al (1987a)
- Evaluated the quality of a Spanish TTS chip for
use for the presentation of grammar exercises in
a language laboratory. - Usage evaluations outcome-oriented
- Santiao-Oriola (1999)
- Evaluated the use of a French TTS synthesiser for
the presentation of dictation exercises. - Hincks (2002)
- Evaluated the use of a Swedish TTS synthesiser in
combination with a speech editor (re-synthesis)
for teaching the lexical stress of English to
Swedophones. - Usage evaluations process-oriented
- Cohen (1993)
- Evaluated the use of a talking word processor to
support literacy activities, namely writing
stories, for young learners of French as a
second language. - Impact evaluations
- Stratil et al (1987b)
- Evaluated user reactions to the use of Spanish
TTS chip for the presentation of grammar
exercises in a language laboratory.
19The evaluation process
- ISO (1999) and EAGLES (1999) guidelines
- Establish the evaluation requirements
- Establish the purpose of the evaluation
- Identify the types of products to be evaluated
- Specify the quality model
- Specify the evaluation
- Select metrics
- Establish rating levels for metrics
- Establish criteria for assessment
- Design the evaluation
- Execute the evaluation
20CALL requirements
- When the language competence of the system
begins to outstrip that of some of the better
second language users, such systems become useful
adjunct tools (Keller and Zellner-Keller, 2000)
21CALL requirements analysis
- Ideal conditions for Second Language Acquisition
(SLA) (Chapelle, 2001) - Language learning potential
- Goals of SLA
- Communicative competence
- Quality of the output
- Primary requirement Comprehensibility/intelligibi
lity - Secondary requirements Accuracy and naturalness
- At both the level of individual speech sounds and
the prosodic level - Focus on form
- Flexibility
- Speech rate, pitch
22Explorative investigation
- Research questions
- Do the different roles identified impose
different requirements on the quality of speech
synthesis? - Does comprehensibility account for acceptability
for use in CALL?
23Design
- Within subjects
- N 17, French Teachers
- Dependent variables
- Comprehensibility
- Acceptability
- Appropriateness
- Frequency and seriousness of errors
- Independent variables
- Role of TTS
- Reading machine
- Pronunciation model
- Dialogue partner
24Materials and apparatus
- 1 research TTS system
- FIPSVox, University of Geneva
- Diphone concatenation
- 20 utterances representative of each role
- Reading machine,
- Pronunciation model,
- Dialogue partner
- Questionnaires
- Likert scales Comprehension, acceptability,
appropriateness - Word point paradigm (Van Santen, 1993)
25Scales used
- Comprehension and acceptability
- Appropriateness
26Word pointing paradigm
27Results
- Friedman test used to test for difference among
roles - Overall Appropriateness
- Significant to plt0.05 (?212.182, df2, p0.002,
two-tailed) - Overall Acceptability
- Significant to plt0.05 (?29.5, df2, p0.009,
two-tailed) - Overall Comprehensibility
- Significant to plt0.05 (?218.667, df2, plt0.001,
two-tailed)
28Results Relationship between comprehensibility
and acceptability
- Spearmans Rho used to test for correlation
- Reading Positively and strongly related
(rho0.793, N12, p0.001, one-tailed) - Pronunciation Positively related, but not
strongly (rho0.547, N12, p0.033, one-tailed) - Conversation Positively related, but not
strongly (rho0.504, N12, p0.047, one-tailed)
29Results (cont.)
- Types of errors highlighted
- Accuracy bad segments, bad words, bad phrasing,
inappropriate intonation, bad sentence stress,
inappropriate rhythmn - Naturalness exaggerated intonation
- Register (formality) inappropriate
dropping/retention of schwa, inappropriate
omision/insertion of liaison - Expressiveness lacked emotion
- Most frequent errors
- Inappropriate intonation, inappropriate rhythmn,
bad phrasing
30Conclusion
- Requirements differ
- Most suitable as a dialogue partner
- Surprising as speech database is read
- Could be because utterances to synthesise are
shorter and less complex than in the role of
reading machine - Least suitable as a pronunciation model
- Comprehensibility is not the only requirement
- Accuracy and naturalness matter
- Further requirements not highlighted by the
literature
31Investigation 2
- Research questions
- Do the different roles identified impose
different requirements on the quality of speech
synthesis? - Is TTS synthesis ready for use in CALL?
32Design
- Within subjects
- N 17, French Teachers
- Dependent variables
- Quality of the speech output
- Acceptability
- Adequacy (appropriateness)
- Readiness
- Independent variables
- Role of TTS
- Reading machine
- Pronunciation model
- Segmental (or phonetic) level
- Suprasegemental (or prosodic) level
- Conversational partner
- TTS synthesis system
33Systems evaluated
- http//www.research.att.com/ttsweb/tts/demo.phpt
op - http//212.8.184.250/tts/demo_login.jsp
- http//www.multitel.be/TTS/layout.php?pageeLite_d
emo - http//www.acapela-group.com/text-to-speech-intera
ctive-demo.html
34Questionnaire
- MOS-CALL
- Based on
- ITU-T Overall Quality Test
- MOS-X (Polkosky and Lewis, 2003)
35On-line presentation
36System 1 ATT Next-Gen (Alain)
Mean ratings of adequacy and acceptability
- Analysis of the data using the Friedman test
revealed that these differences were, however,
not statistically significant (?²r 2.352, df
3, p 0.503 ?²r 6.616, df 3, p 0.085,
respectively).
37System 1 ATT Next-Gen (Alain)
Mean ratings of quality of output
Significance of differences in ratings of quality
of output across roles
38System 2 Nuance Vocalizer (Julie)
Mean ratings of adequacy and acceptability
- Analysis of the data using the Friedman test
revealed that these differences were significant
for adequacy (?²r 8.010, df 3, p 0.046),
but not for acceptability (?²r 6.303, df 3, p
0.098).
39System 2 Nuance Vocalizer (Julie)
Mean ratings of quality of output
Significance of differences in ratings of quality
of output across roles
40System 3 eLite (Vincent)
Mean ratings of adequacy and acceptability
- Like for S1, the test revealed that the
differences were not statistically significant
for either adequacy or acceptability (?²r
3.467, df 3, p 0.325 ?²r 3.194, df 3, p
0.363, respectively).
41System 3 eLite (Vincent)
Mean ratings of quality of output
Significance of differences in ratings of quality
of output across roles
42System 4 BrightSpeech (Julie)
Mean ratings of adequacy and acceptability
- Analysis of the data using the Friedman test
revealed that the differences were statistically
significant for adequacy (?²r 8.063, df 3, p
0.045), but not for acceptability (?²r 5.547,
df 3, p 0.163).
43System 4 BrightSpeech (Julie)
Mean ratings of quality of output
Significance of differences in ratings of quality
of output across roles
44Discussion
- It is not clear whether the different roles
impose different requirements on the quality of
TTS synthesis - The aspects of quality for which statistically
significant differences were found differed
across the TTS synthesis systems - Possible explanations for results
- There are only small differences
- Participants do not have enough context in order
to rate the quality of the speech for use in the
different roles - The similarity in requirements of the two types
of pronunciation model
45Is TTS synthesis ready for use in CALL?
Mean ratings of adequacy
- Different TTS synthesis systems are most suitable
for use in different roles - Reinforces the need to evaluate every TTS
synthesis system - System 4 is ready for use in all applications
where TTS synthesis adds value
Mean ratings of acceptability
46System 1 ATT Next-Gen (Alain)
Mean ratings of quality of output
47System 2 Nuance Vocalizer (Julie)
Mean ratings of quality of output
48System 3 eLite (Vincent)
Mean ratings of quality of output
49Conclusions
- CALL imposes requirements on the following
aspects of the quality of the output of TTS
synthesis systems - Comprehensibility, accuracy, naturalness,
expressiveness - Further research is necessary to determine
whether the different roles have different
requirements - Some French TTS synthesis systems are reaching
readiness for use in CALL in applications which
add value - In order to fully meet the requirements of CALL
more attention needs to be paid to accuracy and
naturalness, in particular at the prosodic level,
and expressiveness - This may not be the case for all languages
different languages pose different problems to TTS
50Recent developments
- Hybrid systems
- FlexVoice (Venkatagiri, 2003).
- Parametric good at synthesising vowels
- Concatenative good at synthesising consonants
- Emotional TTS
- http//www.loquendo.com/en/technology/emotional_tt
s.htm - Flexibility
- Blizzard Challenge to develop a voice in a month
- http//www.synsig.org/index.php/Blizzard_Challenge
_2007
- CapturaTalk
- Take a picture, hear the words
http//www.capturatalk.com/
51Publications and References
- Publications
- The research presented in this seminar is
presented in more detail in - Handley, Z. and Hamel, M.-J. (2005). Establishing
a Methodology for Benchmarking Speech Synthesis
for Computer-Assisted Language Learning (CALL).
Language Learning Technology Journal. 9 (3)
99-119. http//llt.msu.edu/vol9num3/handley/defaul
t.html - Handley, Z. and Hamel, M.-J. (2004).
Investigating the Requirements of Speech
Synthesis for CALL with a View to Developing a
Benchmark. In Procs. InSTIL/ICALL 2004 (pp.
71-74). Venice, Italy. http//sisley.cgm.unive.it/
ICALL2004/papers/018Handley.pdf - References
- Chapelle (2001) Computer Applications in Second
Language Acquisition. Cambridge Cambridge
University Press. - Cohen (1993) The use of a voice synthesizer in
the discovery of the written language by young
children. Computers in Eudcation. 21 (1/2) 25-30 - Dutoit (1997) An Introduction to Text-to-Speech
Synthesis. London Kluwer Academic Publishers. - Ehsani and Knodt (1998) Speech technology in
computer-aided language learning Strengths and
limitations of a new CALL paradigm. Language
Learning Technology. 2 (1) 45-60 - Hamel (1998) Les outils de TALN dans SAFRAN.
RECALL Journal 10 (1) 79-85 - Hamel (2000). FreeText - An advanced hypermedia
CALL system featuring NLP tools for a smart
treatment of authentic documents and free
production exercises. Canadian Association of
Applied Linguistics 2000, Edmonton (Canada), May
2000. - Hincks (2002) Speech Synthesis for Teaching
Lexical Stress. TMH-QPRS 44 135-165 - ISO (1999) Information Technology Software
Product Evaluation Part 1 General Overview.
ISO - Keller and Zellner-Keller (2000) Speech synthesis
in language learning challenges and
opportunities. In Procs. InSTIL 2000 (pp.
109-116). Dundee, Rngland University of Abertay
Dundee. - Paroubek and Blasband (1999) ELSE Executive
Summary (short version). http//www.limsi.fr/TLP/E
LSE/PreambleXwhyXwhatXrev3.htm - Polkosky and Lewis (2003) Expanding the MOS
Development and psychometric evaluation of the
MOS-R and MOS-X. International Journal of Speech
Technology. 6 161-182 - Santiago-Oriola (1999) Vocal Synthesis in a
Computerized Dictation Exercise. In EUROSPEECH99
(pp. 191-194). Budapest, Hungary. - Stratil et al (1987a) Exploration of Foreign
Language Speech Synthesis. Literacy and
Linguistic Computing. 2 (2) 116-119