Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005 - PowerPoint PPT Presentation

About This Presentation
Title:

Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005

Description:

Large Scale Evaluation. of Corpus-based Synthesizers: The Blizzard Challenge 2005 ... http://www.speech.cs.cmu.edu/blizzard/register-R.html ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 27
Provided by: cben4
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005


1
Large Scale Evaluation of Corpus-based
SynthesizersThe Blizzard Challenge 2005
  • Christina Bennett
  • Language Technologies Institute
  • Carnegie Mellon University
  • Student Research Seminar
  • September 23, 2005

2
What is corpus-based speech synthesis?
3
Need for Speech Synthesis Evaluation
Motivation
  • Determine effectiveness of our improvements
  • Closer comparison of various corpus-based
    techniques
  • Learn about users' preferences
  • Healthy competition promotes progress and brings
    attention to the field

4
Blizzard Challenge Goals
Motivation
  • Compare methods across systems
  • Remove effects of different data by providing
    requiring same data to be used
  • Establish a standard for repeatable evaluations
    in the field
  • My goal Bring need for improved speech
    synthesis evaluation to forefront in community
    (positioning CMU as a leader in this regard)

5
Blizzard Challenge Overview
Chal lenge
  • Released first voices and solicited participation
    in 2004
  • Additional voices and test sentences released
    Jan. 2005
  • 1 - 2 weeks allowed to build voices synthesize
    sentences
  • 1000 samples from each system
  • (50 sentences x 5 tests x 4 voices)

6
Evaluation Methods
Chal lenge
  • Mean Opinion Score (MOS)
  • Evaluate sample on a numerical scale
  • Modified Rhyme Test (MRT)
  • Intelligibility test with tested word within a
    carrier phrase
  • Semantically Unpredictable Sentences (SUS)
  • Intelligibility test preventing listeners from
    using knowledge to predict words

7
Challenge setup Tests
Chal lenge
  • 5 tests from 5 genres
  • 3 MOS tests (1 to 5 scale)
  • News, prose, conversation
  • 2 type what you hear tests
  • MRT Now we will say ___ again
  • SUS det-adj-noun-verb-det-adj-noun
  • 50 sentences collected from each system, 20
    selected for use in testing

8
Challenge setup Systems
Chal lenge
  • 6 systems (random ID A-F)
  • CMU
  • Delaware
  • Edinburgh (UK)
  • IBM
  • MIT
  • Nitech (Japan)
  • Plus 1 Team Recording Booth (ID X)
  • Natural examples from the 4 voice talents

9
Challenge setup Voices
Chal lenge
  • CMU ARCTIC databases
  • American English 2 male, 2 female
  • 2 from initial release
  • bdl (m)
  • slt (f)
  • 2 new DBs released for quick build
  • rms (m)
  • clb (f)

10
Challenge setup Listeners
Chal lenge
  • Three listener groups
  • S speech synthesis experts (50)
  • 10 requested from each participating site
  • V volunteers (60, 97 registered)
  • Anyone online
  • U native US English speaking undergraduates
    (58, 67 registered)
  • Solicited and paid for participation

  • as of 4/14/05

11
Challenge setup Interface
Chal lenge
  • Entirely online
  • http//www.speech.cs.cmu.edu/blizzard/register-R.h
    tml
  • http//www.speech.cs.cmu.edu/blizzard/login.html
  • Register/login with email address
  • Keeps track of progress through tests
  • Can stop and return to tests later
  • Feedback questionnaire at end of tests

12
Fortunately, Team X is clear winner
Results
Listener type S Listener type S Listener type V Listener type V Listener type U Listener type U
MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in
X - 4.76 X - 8.5 X - 4.41 X - 10.3 X - 4.58 X - 7.3
D - 3.19 D - 14.7 D - 3.02 D - 17.1 D - 3.06 D - 16.3
E - 3.11 B - 15.0 E - 2.83 A - 19.7 E - 2.83 A - 19.3
C - 2.91 A - 17.4 B - 2.66 B - 20.3 B - 2.67 B - 19.6
B - 2.88 E - 20.6 C - 2.48 E - 25.0 C - 2.42 E - 21.7
F - 2.15 C - 22.5 F - 2.07 C - 25.6 A - 2.00 C - 22.8
A - 2.07 F - 32.7 A - 1.98 F - 41.8 F - 1.98 F - 35.2
13
Team D consistently outperforms others
Results
Listener type S Listener type S Listener type V Listener type V Listener type U Listener type U
MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in
X - 4.76 X - 8.5 X - 4.41 X - 10.3 X - 4.58 X - 7.3
D - 3.19 D - 14.7 D - 3.02 D - 17.1 D - 3.06 D - 16.3
E - 3.11 B - 15.0 E - 2.83 A - 19.7 E - 2.83 A - 19.3
C - 2.91 A - 17.4 B - 2.66 B - 20.3 B - 2.67 B - 19.6
B - 2.88 E - 20.6 C - 2.48 E - 25.0 C - 2.42 E - 21.7
F - 2.15 C - 22.5 F - 2.07 C - 25.6 A - 2.00 C - 22.8
A - 2.07 F - 32.7 A - 1.98 F - 41.8 F - 1.98 F - 35.2
14
Speech experts are biased optimistic
Results
Listener type S Listener type S Listener type V Listener type V Listener type U Listener type U
MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in
X - 4.76 X - 8.5 X - 4.41 X - 10.3 X - 4.58 X - 7.3
D - 3.19 D - 14.7 D - 3.02 D - 17.1 D - 3.06 D - 16.3
E - 3.11 B - 15.0 E - 2.83 A - 19.7 E - 2.83 A - 19.3
C - 2.91 A - 17.4 B - 2.66 B - 20.3 B - 2.67 B - 19.6
B - 2.88 E - 20.6 C - 2.48 E - 25.0 C - 2.42 E - 21.7
F - 2.15 C - 22.5 F - 2.07 C - 25.6 A - 2.00 C - 22.8
A - 2.07 F - 32.7 A - 1.98 F - 41.8 F - 1.98 F - 35.2
15
Speech experts are better in fact experts
Results
Listener type S Listener type S Listener type V Listener type V Listener type U Listener type U
MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in MOS type-in
X - 4.76 X - 8.5 X - 4.41 X - 10.3 X - 4.58 X - 7.3
D - 3.19 D - 14.7 D - 3.02 D - 17.1 D - 3.06 D - 16.3
E - 3.11 B - 15.0 E - 2.83 A - 19.7 E - 2.83 A - 19.3
C - 2.91 A - 17.4 B - 2.66 B - 20.3 B - 2.67 B - 19.6
B - 2.88 E - 20.6 C - 2.48 E - 25.0 C - 2.42 E - 21.7
F - 2.15 C - 22.5 F - 2.07 C - 25.6 A - 2.00 C - 22.8
A - 2.07 F - 32.7 A - 1.98 F - 41.8 F - 1.98 F - 35.2
16
Voice results Listener preference
Results
  • slt is most liked, followed by rms
  • Type S
  • slt - 43.48 of votes cast rms - 36.96
  • Type V
  • slt - 50 of votes cast rms - 28.26
  • Type U
  • slt - 47.27 of votes cast rms - 34.55
  • But, preference does not necessarily match test
    performance

17
Voice results Test performance
Results
Female voices - slt
Listener type S all sys-MOS natural-MOS all sys-type-in natural-type-in
Listener type S rms - 3.233 bdl - 4.827 rms - 10.5 rms - 3.2
Listener type S clb - 3.154 rms - 4.809 clb - 16.0 clb - 9.3
Listener type S slt - 2.994 slt - 4.738 slt - 20.8 bdl - 9.4
Listener type S bdl - 2.941 clb - 4.690 bdl - 22.7 slt - 11.3
Listener type V clb - 2.946 rms - 4.568 rms - 14.0 rms - 3.8
Listener type V rms - 2.894 clb - 4.404 clb - 17.1 bdl - 12.0
Listener type V slt - 2.884 bdl - 4.382 slt - 25.2 slt - 12.0
Listener type V bdl - 2.635 slt - 4.296 bdl - 29.3 clb - 13.1
Listener type U clb - 2.987 slt - 4.611 clb - 11.9 slt - 5.9
Listener type U slt - 2.930 clb - 4.587 slt - 17.5 clb - 5.9
Listener type U rms - 2.873 rms - 4.584 rms - 17.6 rms - 8.8
Listener type U bdl - 2.678 bdl - 4.551 bdl - 28.7 bdl - 9.1
18
Voice results Test performance
Results
Female voices - clb
Listener type S all sys-MOS natural-MOS all sys-type-in natural-type-in
Listener type S rms - 3.233 bdl - 4.827 rms - 10.5 rms - 3.2
Listener type S clb - 3.154 rms - 4.809 clb - 16.0 clb - 9.3
Listener type S slt - 2.994 slt - 4.738 slt - 20.8 bdl - 9.4
Listener type S bdl - 2.941 clb - 4.690 bdl - 22.7 slt - 11.3
Listener type V clb - 2.946 rms - 4.568 rms - 14.0 rms - 3.8
Listener type V rms - 2.894 clb - 4.404 clb - 17.1 bdl - 12.0
Listener type V slt - 2.884 bdl - 4.382 slt - 25.2 slt - 12.0
Listener type V bdl - 2.635 slt - 4.296 bdl - 29.3 clb - 13.1
Listener type U clb - 2.987 slt - 4.611 clb - 11.9 slt - 5.9
Listener type U slt - 2.930 clb - 4.587 slt - 17.5 clb - 5.9
Listener type U rms - 2.873 rms - 4.584 rms - 17.6 rms - 8.8
Listener type U bdl - 2.678 bdl - 4.551 bdl - 28.7 bdl - 9.1
19
Voice results Test performance
Results
Male voices - rms
Listener type S all sys-MOS natural-MOS all sys-type-in natural-type-in
Listener type S rms - 3.233 bdl - 4.827 rms - 10.5 rms - 3.2
Listener type S clb - 3.154 rms - 4.809 clb - 16.0 clb - 9.3
Listener type S slt - 2.994 slt - 4.738 slt - 20.8 bdl - 9.4
Listener type S bdl - 2.941 clb - 4.690 bdl - 22.7 slt - 11.3
Listener type V clb - 2.946 rms - 4.568 rms - 14.0 rms - 3.8
Listener type V rms - 2.894 clb - 4.404 clb - 17.1 bdl - 12.0
Listener type V slt - 2.884 bdl - 4.382 slt - 25.2 slt - 12.0
Listener type V bdl - 2.635 slt - 4.296 bdl - 29.3 clb - 13.1
Listener type U clb - 2.987 slt - 4.611 clb - 11.9 slt - 5.9
Listener type U slt - 2.930 clb - 4.587 slt - 17.5 clb - 5.9
Listener type U rms - 2.873 rms - 4.584 rms - 17.6 rms - 8.8
Listener type U bdl - 2.678 bdl - 4.551 bdl - 28.7 bdl - 9.1
20
Voice results Test performance
Results
Male voices - bdl
Listener type S all sys-MOS natural-MOS all sys-type-in natural-type-in
Listener type S rms - 3.233 bdl - 4.827 rms - 10.5 rms - 3.2
Listener type S clb - 3.154 rms - 4.809 clb - 16.0 clb - 9.3
Listener type S slt - 2.994 slt - 4.738 slt - 20.8 bdl - 9.4
Listener type S bdl - 2.941 clb - 4.690 bdl - 22.7 slt - 11.3
Listener type V clb - 2.946 rms - 4.568 rms - 14.0 rms - 3.8
Listener type V rms - 2.894 clb - 4.404 clb - 17.1 bdl - 12.0
Listener type V slt - 2.884 bdl - 4.382 slt - 25.2 slt - 12.0
Listener type V bdl - 2.635 slt - 4.296 bdl - 29.3 clb - 13.1
Listener type U clb - 2.987 slt - 4.611 clb - 11.9 slt - 5.9
Listener type U slt - 2.930 clb - 4.587 slt - 17.5 clb - 5.9
Listener type U rms - 2.873 rms - 4.584 rms - 17.6 rms - 8.8
Listener type U bdl - 2.678 bdl - 4.551 bdl - 28.7 bdl - 9.1
21
Voice results Natural examples
Results
Listener type S MOS type-in Listener type S MOS type-in Listener type V MOS type-in Listener type V MOS type-in Listener type U MOS type-in Listener type U MOS type-in
bdl - 4.827 rms - 3.2 rms - 4.568 rms - 3.8 slt - 4.611 slt - 5.9
rms - 4.809 clb - 9.3 clb - 4.404 bdl - 12.0 clb - 4.587 clb - 5.9
slt - 4.738 bdl - 9.4 bdl - 4.382 slt - 12.0 rms - 4.584 rms - 8.8
clb - 4.690 slt - 11.3 slt - 4.296 clb - 13.1 bdl - 4.551 bdl - 9.1
What makes natural rms different?
22
Voice results By system
Results
  • Only system B consistent across listener types
    (slt best MOS, rms best WER)
  • Most others showed group trends, i.e.
  • (with exception of B above and F)
  • S rms always best WER, often best MOS
  • V slt usually best MOS, clb usually best WER
  • U clb usually best MOS and always best WER
  • ? Again, people clearly dont prefer the voices
    they most easily understand

23
Lessons learned Listeners
Lessons
  • Reasons to exclude listener data
  • Incomplete test, failure to follow directions,
    inability to respond (type-in), unusable
    responses
  • Type-in tests very hard to process automatically
  • Homophones, misspellings/typos, dialectal
    differences, smart listeners
  • Group differences
  • V most variable, U most controlled, S least
    problematic but not representative

24
Lessons learned Test design
Lessons
  • Feedback re tests
  • MOS Give examples to calibrate scale (ordering
    schema) use multiple scales (lay-people?)
  • Type-in Warn about SUS hard to remember SUS
    words too unusual/hard to spell
  • Uncontrollable user test setup
  • Pros Cons to having natural examples in the mix
  • Analyzing user response (), differences in
    delivery style (-), availability of voice talent
    (?)

25
Goals Revisited
Lessons
  • One methodology clearly outshined rest
  • All systems used same data allowing for actual
    comparison of systems
  • Standard for repeatable evaluations in the field
    was established
  • My goal Brought attention to need for better
    speech synthesis evaluation (while positioning
    CMU as the experts)

26
For the Future
Future
  • (Bi-)Annual Blizzard Challenge
  • Introduced at Interspeech 2005 special session
  • Improve design of tests for easier analysis
    post-evaluation
  • Encourage more sites to submit their systems!
  • More data resources (problematic for the
    commercial entities)
  • Expand types of systems accepted ( therefore
    test types)
  • e.g. voice conversion

27
Lessons learned Listeners (I)
Lessons
  • Reasons for excluding listener data
  • Incomplete test
  • Failure to follow directions
  • Inability to respond to type-in tests (e.g.
    non-native)
  • Unusable responses (8)
  • No effort (dont know)
  • Inappropriate response (wrong sentence)
  • Extremely suspicious scoring (e.g. natural
    examples scored very low relatively)

28
Lessons learned Listeners (II)
Lessons
  • Type-in tests very hard to process
    automatically!
  • Homophones
  • (dug / Doug, cereal / serial)
  • Misspellings / typos
  • (comacozee, kamakazi, kamakasy, )
  • Dialectal differences
  • (been? bean or Ben?)
  • Smart listeners!

29
Lessons learned Listeners (III)
Lessons
  • Group V most variable
  • Least motivation for task (fewest completions)
  • Requires close analysis to determine seriousness
  • Non-natives difficult to extract
  • Group U most controlled
  • Motivated by not science
  • Group S least problematic, but maybe not
    representative

30
Lessons learned Test design (I)
Lessons
  • Most feedback very good, but
  • MOS scale
  • Give examples to calibrate 1 to 5
  • Ordering schema will counterbalance effects of
    user-defined scale
  • Use different scales for various qualities such
    as naturalness, intelligibility, etc.
  • Lay-people dont understand these terms, and they
    are not easily explained
  • Intelligibility tested directly via type-in tests

31
Lessons learned Test design (II)
Lessons
  • Type-in tests
  • Warn listeners about SUS!
  • Hard to remember nonsensical phrases (i.e.
    sentences too long)
  • Words too unusual and/or hard to spell
  • ? Together these add an unwanted memory/spelling
    dimension to the task

32
Lessons learned Test design (III)
Lessons
  • Uncontrollable user test setup
  • One particular audio player/browser combination
    forced browser to advance to new page for every
    sound file
  • Pros Cons to having natural examples in the mix
  • Valuable for analyzing user responses
  • Difficulty in evaluation due to clear effects of
    different delivery styles
  • Availability of voice talent for future
    evaluations

33
References
  • A. Black and K. Tokuda, The Blizzard Challenge
    2005 Evaluating corpus-based speech synthesis on
    common datasets, to appear in Interspeech 2005,
    Lisbon, Portugal, 2005.
  • C. Bennett, Large Scale Evaluation of
    Corpus-based Synthesizers Results and Lessons
    from the Blizzard Challenge 2005, to appear in
    Interspeech 2005, Lisbon, Portugal, 2005.
  • C. Bennett (et al) journal paper more info to
    come
Write a Comment
User Comments (0)
About PowerShow.com