Evaluating TexttoSpeech Synthesis for use in ComputerAssisted Language Learning - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

Evaluating TexttoSpeech Synthesis for use in ComputerAssisted Language Learning

Description:

Practice of intonation and prosody (the music of speech) Auditory discrimination. Repetition ... Naturalness: exaggerated intonation ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 52

Provided by: Hand64

Category:

more less

Transcript and Presenter's Notes

Title: Evaluating TexttoSpeech Synthesis for use in ComputerAssisted Language Learning

1
Evaluating Text-to-Speech Synthesis for use in
Computer-Assisted Language Learning

Zöe Handley
LSRI, University of Nottingham
January 2008

2
Context

EU Project FreeText
An advanced hypermedia CALL system featuring NLP
tools for a smart treatment of authentic
documents and free production exercises (Hamel et
al., 2000)

3
Plan

TTS synthesis in CALL
Evaluation
Requirements analysis
Validation of requirements
Readiness of TTS synthesis for CALL
Conclusions

4
What is TTS synthesis?

Speech synthesis
systems that allow the generation of novel
messages, either from scratch (i.e. entirely by
rule) or by re-combining shorter pre-stored
units (van Bezooijen and van Heuven, 1997 709)
Text-to-Speech Synthesis
The automatic generation of speech from text

5
What is Text-to-Speech Synthesis?

http//www.acapela-group.com/text-to-speech-intera
ctive-demo.html

6
TTS synthesis Why now?

The challenge of TTS synthesis
The man (and he certainly was one!) just said,
Maybe. Ill see. I cant promise.
(McAllister, 1989)
Dr. Jones lives at 11 School Dr. and works on the
corner of St. James St. (Dutoit, 1997)
Rough, through, bough, thought, dough, cough, and
hiccough.
(Divay and Vitale, 1997)

A simple but general diagram of a TTS system
(Dutoit, 1997)
7
TTS synthesis Why now? (cont.)

Parametric synthesis
Simulation of the acoustic acoustic signal
Formant synthesis
Concatenative synthesis
Concatenation of pre-recorded segments of natural
human speech

The spectrogram for the word phonetician
8
TTS synthesis Why now?

Formant synthesis

Concatenative synthesis

http//www.speaknspell.co.uk/speaknspell.html
http//www.acapela-group.com/text-to-speech-intera
ctive-demo.html
9
Why TTS synthesis in CALL?

There is a general need in language learning and
teaching for self-paced interactive learning
environments which provide controlled
interactive speaking practice outside the
classroom (Ehsani and Knodt, 1998 45).

10
CALL applications Reading Machine

Talking dictionary
Talking text
Talking word processor
Talking conjugator
Dictation
Grapheme?phoneme exercises

Oxford Hachette 4 French Dictionary on CD-ROM
11
CALL applications Pronunciation model

Practice of individual and combined phonemes
Auditory discrimination
Repetition
Practice of intonation and prosody (the music of
speech)
Auditory discrimination
Repetition

SAFexo (Hamel, 1998 2003)
12
CALL applications Dialogue partner

In combination with automatic speech recognition,
speech understanding, the generative power of TTS
synthesis can be harnessed to provide learners
with interactive speaking practice, i.e. a
dialogue partner

Mr Smoketoomuch Monty Python sketch (KTH, 1999)
http//www.speech.kth.se/
13
Benefits of TTS synthesis

Easy creation and editing of speech samples
Simultaneous presentation of text and speech
Low storage requirements
Non-human and therefore perceived as
non-judgemental
Improves on possibilities other media provide,
but does not add value, i.e. bring about new
possibilities

14
Benefits of TTS synthesis

Generation of examples on demand (Sherwood, 1981)
and therefore the automatic generation of
feedback, conversational turns, and exercises
with speech models
Adds value to CALL, i.e. brings about new
possibilities such as provision of interactive
conversations

15
Why evaluation?

Few CALL applications integrating TTS synthesis
are available on the market
Since the failure of the language laboratory
teachers have been sceptical about unevaluated
technologies
The most common role that TTS synthesis assumes
outside CALL is that of a reading machine

16
Evaluation of TTS synthesis for CALL

CALL evaluation framework (Chapelle, 2001)
Judgemental evaluation of the application
Judgemental evaluation of the planned activity
Evaluation of learners performance
Product oriented
Process oriented

Speech and Language Technology (SALT) evaluation
framework (Paroubek and Blasband, 1999)
Basic research evaluation
Technology evaluation
Usage evaluation

17
Framework for the evaluation of TTS synthesis for
use in CALL

Level 1
Viability and potential benefits of the use of
TTS synthesis in CALL
Level 2
Adequacy of TTS synthesis for use in CALL
Level 3
Potential of the CALL program to provide ideal
conditions for SLA
Level 4
Potential of the planned activity to provide
ideal conditions for SLA
Level 5
Learners performance in the planned activity
Level 6
Success of the funding program

18
Evaluations of TTS Synthesis for CALL

Technology evaluations
Stratil et al (1987a)
Evaluated the quality of a Spanish TTS chip for
use for the presentation of grammar exercises in
a language laboratory.
Usage evaluations outcome-oriented
Santiao-Oriola (1999)
Evaluated the use of a French TTS synthesiser for
the presentation of dictation exercises.
Hincks (2002)
Evaluated the use of a Swedish TTS synthesiser in
combination with a speech editor (re-synthesis)
for teaching the lexical stress of English to
Swedophones.
Usage evaluations process-oriented
Cohen (1993)
Evaluated the use of a talking word processor to
support literacy activities, namely writing
stories, for young learners of French as a
second language.
Impact evaluations
Stratil et al (1987b)
Evaluated user reactions to the use of Spanish
TTS chip for the presentation of grammar
exercises in a language laboratory.

19
The evaluation process

ISO (1999) and EAGLES (1999) guidelines
Establish the evaluation requirements
Establish the purpose of the evaluation
Identify the types of products to be evaluated
Specify the quality model
Specify the evaluation
Select metrics
Establish rating levels for metrics
Establish criteria for assessment
Design the evaluation
Execute the evaluation

20
CALL requirements

When the language competence of the system
begins to outstrip that of some of the better
second language users, such systems become useful
adjunct tools (Keller and Zellner-Keller, 2000)

21
CALL requirements analysis

Ideal conditions for Second Language Acquisition
(SLA) (Chapelle, 2001)
Language learning potential
Goals of SLA
Communicative competence
Quality of the output
Primary requirement Comprehensibility/intelligibi
lity
Secondary requirements Accuracy and naturalness
At both the level of individual speech sounds and
the prosodic level
Focus on form
Flexibility
Speech rate, pitch

22
Explorative investigation

Research questions
Do the different roles identified impose
different requirements on the quality of speech
synthesis?
Does comprehensibility account for acceptability
for use in CALL?

23
Design

Within subjects
N 17, French Teachers
Dependent variables
Comprehensibility
Acceptability
Appropriateness
Frequency and seriousness of errors

Independent variables
Role of TTS
Reading machine
Pronunciation model
Dialogue partner

24
Materials and apparatus

1 research TTS system
FIPSVox, University of Geneva
Diphone concatenation
20 utterances representative of each role
Reading machine,
Pronunciation model,
Dialogue partner
Questionnaires
Likert scales Comprehension, acceptability,
appropriateness
Word point paradigm (Van Santen, 1993)

25
Scales used

Comprehension and acceptability
Appropriateness

26
Word pointing paradigm
27
Results

Friedman test used to test for difference among
roles
Overall Appropriateness
Significant to plt0.05 (?212.182, df2, p0.002,
two-tailed)
Overall Acceptability
Significant to plt0.05 (?29.5, df2, p0.009,
two-tailed)
Overall Comprehensibility
Significant to plt0.05 (?218.667, df2, plt0.001,
two-tailed)

28
Results Relationship between comprehensibility
and acceptability

Spearmans Rho used to test for correlation
Reading Positively and strongly related
(rho0.793, N12, p0.001, one-tailed)
Pronunciation Positively related, but not
strongly (rho0.547, N12, p0.033, one-tailed)
Conversation Positively related, but not
strongly (rho0.504, N12, p0.047, one-tailed)

29
Results (cont.)

Types of errors highlighted
Accuracy bad segments, bad words, bad phrasing,
inappropriate intonation, bad sentence stress,
inappropriate rhythmn
Naturalness exaggerated intonation
Register (formality) inappropriate
dropping/retention of schwa, inappropriate
omision/insertion of liaison
Expressiveness lacked emotion
Most frequent errors
Inappropriate intonation, inappropriate rhythmn,
bad phrasing

30
Conclusion

Requirements differ
Most suitable as a dialogue partner
Surprising as speech database is read
Could be because utterances to synthesise are
shorter and less complex than in the role of
reading machine
Least suitable as a pronunciation model
Comprehensibility is not the only requirement
Accuracy and naturalness matter
Further requirements not highlighted by the
literature

31
Investigation 2

Research questions
Do the different roles identified impose
different requirements on the quality of speech
synthesis?
Is TTS synthesis ready for use in CALL?

32
Design

Within subjects
N 17, French Teachers
Dependent variables
Quality of the speech output
Acceptability
Adequacy (appropriateness)
Readiness

Independent variables
Role of TTS
Reading machine
Pronunciation model
Segmental (or phonetic) level
Suprasegemental (or prosodic) level
Conversational partner
TTS synthesis system

33
Systems evaluated

http//www.research.att.com/ttsweb/tts/demo.phpt
op
http//212.8.184.250/tts/demo_login.jsp
http//www.multitel.be/TTS/layout.php?pageeLite_d
emo
http//www.acapela-group.com/text-to-speech-intera
ctive-demo.html

34
Questionnaire

MOS-CALL
Based on
ITU-T Overall Quality Test
MOS-X (Polkosky and Lewis, 2003)

35
On-line presentation
36
System 1 ATT Next-Gen (Alain)
Mean ratings of adequacy and acceptability

Analysis of the data using the Friedman test
revealed that these differences were, however,
not statistically significant (?²r 2.352, df
3, p 0.503 ?²r 6.616, df 3, p 0.085,
respectively).

37
System 1 ATT Next-Gen (Alain)
Mean ratings of quality of output
Significance of differences in ratings of quality
of output across roles
38
System 2 Nuance Vocalizer (Julie)
Mean ratings of adequacy and acceptability

Analysis of the data using the Friedman test
revealed that these differences were significant
for adequacy (?²r 8.010, df 3, p 0.046),
but not for acceptability (?²r 6.303, df 3, p
0.098).

39
System 2 Nuance Vocalizer (Julie)
Mean ratings of quality of output
Significance of differences in ratings of quality
of output across roles
40
System 3 eLite (Vincent)
Mean ratings of adequacy and acceptability

Like for S1, the test revealed that the
differences were not statistically significant
for either adequacy or acceptability (?²r
3.467, df 3, p 0.325 ?²r 3.194, df 3, p
0.363, respectively).

41
System 3 eLite (Vincent)
Mean ratings of quality of output
Significance of differences in ratings of quality
of output across roles
42
System 4 BrightSpeech (Julie)
Mean ratings of adequacy and acceptability

Analysis of the data using the Friedman test
revealed that the differences were statistically
significant for adequacy (?²r 8.063, df 3, p
0.045), but not for acceptability (?²r 5.547,
df 3, p 0.163).

43
System 4 BrightSpeech (Julie)
Mean ratings of quality of output
Significance of differences in ratings of quality
of output across roles
44
Discussion

It is not clear whether the different roles
impose different requirements on the quality of
TTS synthesis
The aspects of quality for which statistically
significant differences were found differed
across the TTS synthesis systems
Possible explanations for results
There are only small differences
Participants do not have enough context in order
to rate the quality of the speech for use in the
different roles
The similarity in requirements of the two types
of pronunciation model

45
Is TTS synthesis ready for use in CALL?
Mean ratings of adequacy

Different TTS synthesis systems are most suitable
for use in different roles
Reinforces the need to evaluate every TTS
synthesis system
System 4 is ready for use in all applications
where TTS synthesis adds value

Mean ratings of acceptability
46
System 1 ATT Next-Gen (Alain)
Mean ratings of quality of output
47
System 2 Nuance Vocalizer (Julie)
Mean ratings of quality of output
48
System 3 eLite (Vincent)
Mean ratings of quality of output
49
Conclusions

CALL imposes requirements on the following
aspects of the quality of the output of TTS
synthesis systems
Comprehensibility, accuracy, naturalness,
expressiveness
Further research is necessary to determine
whether the different roles have different
requirements
Some French TTS synthesis systems are reaching
readiness for use in CALL in applications which
add value
In order to fully meet the requirements of CALL
more attention needs to be paid to accuracy and
naturalness, in particular at the prosodic level,
and expressiveness
This may not be the case for all languages
different languages pose different problems to TTS

50
Recent developments

Hybrid systems
FlexVoice (Venkatagiri, 2003).
Parametric good at synthesising vowels
Concatenative good at synthesising consonants
Emotional TTS
http//www.loquendo.com/en/technology/emotional_tt
s.htm
Flexibility
Blizzard Challenge to develop a voice in a month
http//www.synsig.org/index.php/Blizzard_Challenge
_2007

CapturaTalk
Take a picture, hear the words

http//www.capturatalk.com/
51
Publications and References

Publications
The research presented in this seminar is
presented in more detail in
Handley, Z. and Hamel, M.-J. (2005). Establishing
a Methodology for Benchmarking Speech Synthesis
for Computer-Assisted Language Learning (CALL).
Language Learning Technology Journal. 9 (3)
99-119. http//llt.msu.edu/vol9num3/handley/defaul
t.html
Handley, Z. and Hamel, M.-J. (2004).
Investigating the Requirements of Speech
Synthesis for CALL with a View to Developing a
Benchmark. In Procs. InSTIL/ICALL 2004 (pp.
71-74). Venice, Italy. http//sisley.cgm.unive.it/
ICALL2004/papers/018Handley.pdf
References
Chapelle (2001) Computer Applications in Second
Language Acquisition. Cambridge Cambridge
University Press.
Cohen (1993) The use of a voice synthesizer in
the discovery of the written language by young
children. Computers in Eudcation. 21 (1/2) 25-30
Dutoit (1997) An Introduction to Text-to-Speech
Synthesis. London Kluwer Academic Publishers.
Ehsani and Knodt (1998) Speech technology in
computer-aided language learning Strengths and
limitations of a new CALL paradigm. Language
Learning Technology. 2 (1) 45-60
Hamel (1998) Les outils de TALN dans SAFRAN.
RECALL Journal 10 (1) 79-85
Hamel (2000). FreeText - An advanced hypermedia
CALL system featuring NLP tools for a smart
treatment of authentic documents and free
production exercises. Canadian Association of
Applied Linguistics 2000, Edmonton (Canada), May
2000.
Hincks (2002) Speech Synthesis for Teaching
Lexical Stress. TMH-QPRS 44 135-165
ISO (1999) Information Technology Software
Product Evaluation Part 1 General Overview.
ISO
Keller and Zellner-Keller (2000) Speech synthesis
in language learning challenges and
opportunities. In Procs. InSTIL 2000 (pp.
109-116). Dundee, Rngland University of Abertay
Dundee.
Paroubek and Blasband (1999) ELSE Executive
Summary (short version). http//www.limsi.fr/TLP/E
LSE/PreambleXwhyXwhatXrev3.htm
Polkosky and Lewis (2003) Expanding the MOS
Development and psychometric evaluation of the
MOS-R and MOS-X. International Journal of Speech
Technology. 6 161-182
Santiago-Oriola (1999) Vocal Synthesis in a
Computerized Dictation Exercise. In EUROSPEECH99
(pp. 191-194). Budapest, Hungary.
Stratil et al (1987a) Exploration of Foreign
Language Speech Synthesis. Literacy and
Linguistic Computing. 2 (2) 116-119