Goals and Objectives - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Goals and Objectives

Description:

PHONETIC CONFUSION MATRICES FOR THE 2001 EVALUATION ... COMPENSATING FOR TRANSCRIPTION CONFUSION PATTERNS LOWERS THE PHONE 'ERROR' ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 76
Provided by: stevegr4
Category:

less

Transcript and Presenter's Notes

Title: Goals and Objectives


1
Phonetic Dissection of Switchboard-Corpus Automat
ic Speech Recognition Systems Steven Greenberg
and Shuangyu Chang International Computer Science
Institute 1947 Center Street, Berkeley, CA
94704 steveng, shawnc_at_icsi.berkeley.edu http/
/www.icsi.berkeley.edu/steveng Large
Vocabulary Continuous Speech Recognition Workshop
Maritime Institute of Technology, Linthicum
Heights, MD, May 4, 2001
2
Take Home Messages
  • PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY
    FACTOR UNDERLYING THE ABILITY TO CORRECTLY
    RECOGNIZE WORDS
  • Many different analyses (to follow) support this
    conclusion
  • Consonants appear to be more important than
    vowels
  • SYLLABLE STRUCTURE IS ALSO AN IMPORTANT FACTOR
    FOR ACCURATE RECOGNITION
  • The pattern of errors differs across the syllable
    (onset, nucleus, coda) and exhibit consistent
    patterns difficult to discern with other units of
    analysis
  • STRESS-ACCENT MAY PLAY AN IMPORTANT ROLE,
    PARTICULARLY FOR UNDERSTANDING THE NATURE OF
    WORD-DELETION ERRORS
  • Relation among stress-accent, syllable structure,
    vocalic identity and length
  • THE NATURE OF PRONUNCIATION MODELS and THEIR
    RELATION TO LEXICAL REPRESENTATIONS IS A
    POTENTIALLY KEY FACTOR
  • The unit of lexical representation (phones,
    articulatory features, etc.) is probably of the
    utmost importance for optimizing ASR performance
  • FUTURE PROGRESS IN ASR SYSTEM DEVELOPMENT IS
    LIKELY TO DEPEND ON DEEP INSIGHT INTO THE
    NATURE OF SPOKEN LANGUAGE

3
Structure of the Presentation
  • DESCRIPTION OF THE CORPUS MATERIALS FOR THE 2000
    AND 2001 EVALUATIONS
  • 2000 Brief (2-17 s) utterances spoken by
    hundreds of different speakers. No relation to
    competitive evaluation
  • 2001 A subset of the competitive evaluation
  • BRIEF OVERVIEW OF THE ANALYSIS REGIME COMMON TO
    THE 2000 AND 2001 PHONETIC EVALUATIONS
  • File formats, time-mediated alignment,
    statistical analysis of the corpora, etc.
  • Details are contained in Linguistic Dissection
    .. (in workshop notebook) and in An
    Introduction . (NIST Speech Transcription
    Workshop, 2000)
  • ANALYSES AND PATTERNS COMMON TO BOTH 2000 and
    2001 EVALUATIONS
  • Syllable structure, phonetic segments,
    articulatory-acoustic features. Details
    pertaining to the 2000 evaluation are in the
    papers cited above
  • PHONETIC CONFUSION MATRICES FOR THE 2001
    EVALUATION
  • FUTURE ANALYSIS PLANNED FOR THIS SPRING WHEN
    REMAINING 2001 SUBMISSIONS ARRIVE
  • Relationship between phonetic classification,
    pronunciation and language models

4
Evaluation Material - 2000
  • SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS
  • Switchboard contains informal telephone dialogues
  • 54 minutes of material that previously
    phonetically transcribed (by highly trained
    phonetics students from UC-Berkeley)
  • All of this material was hand-segmented at either
    the phonetic- segment or syllabic level by the
    transcribers
  • The syllabic-segmented material was subsequently
    segmented at the phonetic-segment level by a
    special-purpose neural network trained on
    72-minutes of hand-segmented Switchboard
    material. This automatic segmentation was
    manually verified.
  • THE PHONETIC SYMBOL SET and STP TRANSCRIPTIONS
    USED IN THE    CURRENT PROJECT ARE AVAILABLE ON
    THE PHONEVAL WEB SITE
  • http//www.icsi.berkeley.edu/real/phoneval
  • THE ORIGINAL FOUR HOURS OF TRANSCRIPTION MATERIAL
       ARE AVAILABLE AT
  • http//www.icsi.berkeley.edu/real/stp

5
Evaluation Material Details - 2000
  • 581 DIFFERENT SPEAKERS
  • AN EQUAL BALANCE OF MALE AND FEMALE SPEAKERS
  • BROAD DISTRIBUTION OF UTTERANCE DURATIONS
  • 2-4 sec - 40, 4-8 sec - 50, 8-17 sec - 10
  • COVERAGE OF ALL (7) U.S. DIALECT REGIONS IN
    SWITCHBOARD
  • A WIDE RANGE OF DISCUSSION TOPICS
  • VARIABILITY IN DIFFICULTY (VERY EASY TO VERY
    HARD)

By Subjective Difficulty
By Dialect Region
Number of Utterances
Subjective Difficulty
Dialect Region
6
Evaluation Material - 2001
  • SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS
  • Seventy-four minutes of material phonetically
    labeled by five highly trained phonetics
    students from UC-Berkeley plus S. Greenberg
  • The material was hand-segmented at the syllabic
    level by the transcribers
  • The syllabic-segmented material was subsequently
    segmented at the phonetic-segment level by a
    special-purpose neural network trained
    originally on 72-minutes of hand-segmented
    Switchboard material (similar to the process
    performed the previous year)
  • THE PHONETIC SYMBOL SET and STP TRANSCRIPTIONS
    USED ARE AVAILABLE ON THE PHONEVAL WEB SITE
  • http//www.icsi.berkeley.edu/real/phoneval

7
Evaluation Material Details - 2001
  • A SUBSET OF THE HUB-5 COMPETITIVE EVALUATION
    CORPUS
  • A representative selection from the evaluation
    set, including an even distribution of data from
    the three main recording conditions (cellular and
    2 land-line conditions)
  • 21 SEPARATE CONVERSATIONS (2 speakers per
    conversation)
  • 42 DIFFERENT SPEAKERS
  • A TOTAL OF 74 MINUTES OF SPOKEN LANGUAGE MATERIAL
  • (including FILLED PAUSES, JUNCTURES, etc.)
  • AVERAGE LENGTH OF SPEECH PER SPEAKER 106
    seconds
  • RANGE OF LENGTH PER SPEAKER 48 s (least) to
    226 s (most)
  • STANDARD DEVIATION 38 s
  • APPROXIMATELY ONE-THIRD OF THE MATERIAL FROM CELL
    PHONES

8
Evaluation Sites - 2000
  • EIGHT SITES PARTICIPATED IN THE EVALUATION
  • All eight provided material for the
    unconstrained-recognition phase
  • Six sites also provided sufficient
    forced-alignment-recognition material (i.e.,
    phone/word labels and segmentation given the word
    transcript for each utterance) for a detailed
    analysis
  • ATT (forced-alignment recognition incomplete,
    not analyzed )
  • Bolt, Beranek and Newman
  • Cambridge University
  • Dragon (forced-alignment recognition incomplete,
    not analyzed )
  • Johns Hopkins University
  • Mississippi State University
  • SRI International
  • University of Washington

9
Evaluation Sites - 2001
  • SEVEN SITES ARE PARTICIPATING IN THE EVALUATION
  • Unconstrained-recognition phase 6 Sites
  • Forced-alignment 7 Sites
  • Phone classification confidence scores 5 Sites
  • Variable condition recognition 2 Sites
  • Phone strings to words - 1 Site
  • ATT
  • Bolt, Beranek and Newman
  • IBM
  • Johns Hopkins University
  • Mississippi State University
  • Philips
  • SRI International

10
Evaluation Data Status - 2001
  • However NOT ALL OF THE MATERIAL REQUIRED TO
    PERFORM THE ANALYSES HAVE MATERIALIZED
  • The tables below summarize the commitments and
    currently usable data (certain data arrived in
    not-quite-ready-for-prime-time form)

Commitments
Current (usable data)
11
Initial Recognition File - Example
  • Parameter Key
  • START - Begin time (in seconds) of phone
  • DUR - Duration (in sec) of phone
  • PHN - Hypothesized phone ID
  • WORD - Hypothesized Word ID
  • Format is for all 674 files in the evaluation set
  • (Example courtesy of MSU)

12
Phone Mapping Procedure
  • EACH SUBMISSION SITE USED A (QUASI) CUSTOM PHONE
    SET
  • Most of the phone sets are available on the
    PHONEVAL web site
  • THE SITES PHONE SETS WERE MAPPED TO A COMMON
    REFERENCE PHONE SET
  • The reference phone set is based on the ICSI
    Switchboard transcription material (STP), but
    is adapted to match the less granular symbol
    sets used by the submission sites
  • The set of mapping conventions relating to the
    STP (and reference) sets are also available on
    the PHONEVAL web site
  • THE REFERENCE PHONE SET WAS ALSO MAPPED TO THE
    SUBMISSION SITE PHONE SETS
  • This reverse mapping was done in order to insure
    that variants of a phone were given due credit
    in the scoring procedure
  • For example - em (syllabic nasal) is mapped to
    ix m, the vowel ix maps in certain
    instances to both ih and ax, depending on the
    specifics of the phone set

13
Phone Scoring Procedures - 2001
  • TWO METHODS WERE USED FOR THE 2001 EVALUATION
  • The UNCOMPENSATED form is the same as last
    years scoring method. Only common phone
    ambiguities (such as ix, ih, ah. ax,
    etc. are allowed
  • The TRANSCRIPTION-COMPENSATED form allows for
    certain phones commonly confused among human
    transcribers to be scored as correct, even
    though they would otherwise be scored as wrong
  • The compensated form of transcription lowers the
    phone error by ca. 10-20
  • TIME-MEDIATED SCORING WAS OF TWO VARIETIES
  • A STRICT form is identical to that used in
    last years evaluation. There is a severe penalty
    for deviations from time boundaries for words
    and phones
  • A LENIENT form allows for a much looser fit
    between time markers associated with words and
    phones. A weighting of 0.15 (relative to the
    STRICT form) was used (by modifying the penalty
    algorithm in SC-Lite). The 0.15 weight reduced
    the number of phone errors by ca. 20 without
    a significant decline in false-positive responses

14
Visualization of a 3-D Confusion Matrix
  • When the matrix is sparsely coded, as below, it
    is more efficient to view the pattern as if
    squashed against a brick wall (see below)

The diagonal is plotted in a linear plane
15
Interlabeler Agreement (74) - 3 Transcribers
  • Highest for consonants (especically the stops)
  • Lowest for vowels (particularly the lax
    monophthongs)

Vowels
Proportion Concordance
Consonants
Phonetic Segment
Numbers refer to the concordance diagonal in the
confusion matrices
16
Interlabeler Disagreement Patterns - 2001
  • INTERLABELER DISAGREEMENT PATTERNS WERE DERIVED
    FROM THE 2000 EVALUATION MATERIAL
  • Several minutes of 3 transcribers material
    transcribed in common were analyzed (2 from
    1996-1997 STP, 1 from 2001 STP)
  • THE FOLLOWING PATTERNS WERE OBSERVED IN THE
    INTERLABELER DISAGREEMENT ANALYSIS
  • Consonants
  • Stop and nasal consonants exhibit a small amount
    of disagreement
  • Fricatives exhibit slightly higher amounts of
    disagreement
  • Liquids show a moderate amount of disagreement
  • Vowels
  • Lax monophthongs exhibit a high amount of
    disagreement
  • Diphthongs show a relatively small amount of
    disagreement
  • Tense, low monophthongs show relatively little
    disagreement (except for ao (probably a
    dialect issue)
  • Overall Transcriber Agreement was 70

17
Interlabeler Disagreement Patterns - 2001
  • FROM SUCH PATTERNS THE FOLLOWING FORMS OF
    TOLERANCES WERE ALLOWED IN TRANSCRIPTION
    COMPENSATED SCORING

Segment d k s n r iy ao ax ix
UNcompensated d k s n r iy ao ax
ix ih ax
Compensated d dx k s z n nx ng
en r axr er iy ix ih ao aa
ow ax ah aa ix ix ih iy ax
18
Transcription Compensation Affects Phone Error
  • COMPENSATING FOR TRANSCRIPTION CONFUSION PATTERNS
    LOWERS THE PHONE ERROR APPRECIABLY FOR MOST
    SITES

STRICTTime Mediation
Error Rate
19
Transcription Compensation Affects Phone Error
  • COMPENSATING FOR TRANSCRIPTION CONFUSION PATTERNS
    LOWERS THE PHONE ERROR APPRECIABLY FOR MOST
    SITES

LENIENTTime Mediation
Error Rate
20
Generation of Evaluation Data - 1
21
CTM File Format for Word Scoring
  • EACH SITES MATERIAL WAS PROCESSED THROUGH
    SC-LITE TO OBTAIN A WORD-ERROR SCORE AND ERROR
    ANALYSIS (IN TERMS OF ERROR TYPE)

ERROR KEY C CORRECT I INSERTION N NULL
ERROR S SUBSTITUTION
22
Generation of Evaluation Data - 2
23
Summary of Corpus Acoustic Properties
  • LEXICAL PROPERTIES
  • Lexical Identity
  • Unigram Frequency
  • Number of Syllables in Word
  • Number of Phones in Word
  • Word Duration
  • Speaking Rate
  • Prosodic Prominence
  • Energy Level
  • Lexical Compounds
  • Non-Words
  • Word Position in Utterance
  • SYLLABLE PROPERTIES
  • Syllable Structure
  • Syllable Duration
  • Syllable Energy
  • Prosodic Prominence
  • Prosodic Context
  • PHONE PROPERTIES
  • Phonetic Identity
  • Phone Frequency
  • Position within the Word
  • Position within the Syllable
  • Phone Duration
  • Speaking Rate
  • Phonetic Context
  • Contiguous Phones Correct
  • Contiguous Phones Wrong
  • Phone Segmentation
  • Articulatory Features
  • Articulatory Feature Distance
  • Phone Confusion Matrices
  • OTHER PROPERTIES
  • Speaker (Dialect, Gender)
  • Utterance Difficulty
  • Utterance Energy
  • Utterance Duration

24
Word- and Phone-Centric Big Lists
  • THE BIG LISTS CONTAIN SUMMARY INFORMATION ON
    55-65    SEPARATE PARAMETERS ASSOCIATED WITH
    PHONES,    SYLLABLES, WORD, UTTERANCES AND
    SPEAKERS    SYNCHRONIZED TO EITHER THE WORD (THIS
    SLIDE) OR THE PHONE

25
Generation of Evaluation Data - 3
26
Phoneval-2000 Web Site
  • FORCED ALIGNMENT FILES
  • Forced Alignment Files
  • BBN , JHU, MSU, WASH
  • Word-Level Alignment Errors
  • BBN , CU, JHU, MSU, SRI, WASH
  • Phone Error (Forced Alignment)
  • CU, BBN, JHU, MSU, SRI, WASH
  • Alignment Word-Phone Mapping
  • BBN , JHU, MSU, WASH
  • BIG LISTS
  • Word-Centric
  • BBN, CU, JHU, MSU, SRI, WASH
  • Phone-Centric
  • BBN, JHU, MSU, WASH
  • Phonetic Confusion Matrices
  • BBN, JHU, MSU, WASH
  • RECOGNITION FILES
  • Converted Submissions
  • ATT, BBN , JHU, MSU, SRI, WASH
  • Word Level Recognition Errors
  • ATT, CU, BBN , JHU, MSU, SRI, WASH
  • Phone Error (Free Recognition)
  • ATT, BBN, JHU, MSU, WASH
  • Word Recognition Phone Mapping
  • ATT, BBN, JHU, MSU, WASH
  • BIG LISTS
  • Word-Centric
  • ATT, CU, BBN, JHU, MSU, SRI, WASH
  • Phone-Centric
  • ATT, BBN, JHU, MSU, WASH
  • Phonetic Confusion Matrices
  • ATT, BBN, JHU, MSU, WASH
  • Description of the STP Phone Set
  • STP Transcription Material
  • Phone-Word Reference
  • Syllable-Word Reference
  • Phone Mapping for Each Site
  • ATT, BBN , JHU, MSU, WASH
  • STP-to-Reference Map
  • STP Phone-to-Articulatory-Feature Map

http//www.icsi.berkeley.edu/real/phoneval
27
A Syllable-Centric Perspective
In this presentation we will drill down from
the lexical to the phonetic tiers by way of the
syllable, the phone and articulatory-acoustic
features
Words
Stress-accent
Phonetic segment
Articulatory-Acoustic Features
28
Coarse Word and Phone Recognition
  • THE FOLLOWING SLIDES PROVIDE DETAILS ABOUT THE
    COARSE WORD AND PHONE SCORES FOR THE 2000 AND
    2001 EVALUATIONS
  • ALTHOUGH THE WORD AND PHONE SCORES ARE ROUGHLY
    COMPARABLE ACROSS YEARS (FOR ANALOGOUS
    CONDITIONS) THE 2001 EVALUATION HAS FOUR TIMES
    THE NUMBER OF SCORING CONDITIONS (FOR PHONES)
    BASED ON THE LENIENT vs. STRICT
    TIME-MEDIATION AND THE COMPENSATED vs.
    UNCOMPENSATED TRANSCRIPTION SCORING

29
Word Recognition Error (2000)
  • WORD ERROR RATES VARY BETWEEN 27 AND 43
  • Substitutions are the major source of word errors

Site
Error Rate
Error Type
30
Prosodic Stress Word Error Rate (2000)
  • The effect of stress is most concentrated among
    word-deletion errors

Data represent averages across all eight ASR
systems
31
Syllable Structure Word Error Rate (2000)
  • Vowel-initial forms show the greatest error
  • Polysyllabic forms exhibit the lowest error
  • C Consonant
  • V Vowel
  • Data are averaged across all eight sites

32
Syllable Structure Word Error Rate (2000)
  • VOWEL-INITIAL forms exhibit the HIGHEST error
  • POLYSYLLABLES have the LOWEST error rate

33
Word Recognition Error (2001)
  • WORD ERROR RATES VARY BETWEEN 33 AND 49
  • Substitutions are the major source of phone errors

Site
Error Rate
Error Type
STRICT Time Mediation
34
Word Recognition Error (2001)
  • WORD ERROR RATES VARY BETWEEN 31 AND 44
  • Substitutions are the major source of phone errors

Site
Error Rate
Error Type
LENIENT Time Mediation
35
Prosodic Stress Word Error Rate (2001)
  • NOT YET
  • PROSODIC LABELING OF THIS MATERIAL REQUIRED FIRST
  • ANALYSIS SCHEDULED FOR JUNE, 2001

36
Syllable Structure Word Error Rate (2001)
  • Vowel-initial forms show the greatest error
  • Polysyllabic forms exhibit the lowest error,
    except fpr CVCV forms (probably due to forms
    such as gonna, etc.)
  • Data are averaged across all five sites

37
Syllable Structure Word Error Rate (2001)
  • VOWEL-INITIAL forms exhibit the HIGHEST error
  • POLYSYLLABLES have the LOWEST error rate

38
Are Word and Phone Errors Related? (2000)
  • COMPARISON OF THE WORD AND PHONE ERROR RATES
    ACROSS     SITES SUGGESTS THAT WORD ERROR IS
    HIGHLY DEPENDENT ON     THE PHONE ERROR RATE
  • The correlation between the two parameters is 0.78

Pronunciation Models?
The differential error rate is probably related
to the use of either pronunciation or language
models (or both)
Error Rate
Submission Site
39
Are Word and Phone Errors Related? (2001)
  • COMPARISON OF THE WORD AND PHONE ERROR RATES
    ACROSS     SITES SUGGESTS THAT WORD ERROR IS
    HIGHLY DEPENDENT ON     THE PHONE ERROR RATE

Pronunciation Model?
StrictTime Mediation
TranscriptionUnCompensated
Error Rate
40
Are Word and Phone Errors Related? (2001)
  • COMPARISON OF THE WORD AND PHONE ERROR RATES
    ACROSS     SITES SUGGESTS THAT WORD ERROR IS
    HIGHLY DEPENDENT ON     THE PHONE ERROR RATE

Pronunciation Model?
LenientTime Mediation
TranscriptionUnCompensated
Error Rate
41
Phonetic - Pronunciation Mismatch
  • THERE ARE A FAR GREATER NUMBER OF PRONUNCIATIONS
    IN THE TRANSCRIPTION MATERIALS THAN IN THE ASR
    LEXICONS
  • GIVEN THAT MOST WORDS ARE CORRECTLY RECOGNIZED,
    THIS RESULT IMPLIES THAT PHONETIC
    CLASSIFICATION IN ASR SYSTEMS IS, BY NECESSITY,
    HIGHLY AGRANULAR
  • THUS, UNUSUAL PRONUNCIATIONS ARE UNLIKELY TO BE
    DECODED CORRECTLY
  • THE COARSE NATURE OF THE PRONUNCIATION MODELS
    ALSO MAKE IT DIFFICULT TO FINE-TUNE THE RELATION
    BETWEEN THE PHONETIC CLASSIFIER AND
    PRONUNCIATION MODEL COMPONENTS

42
Pronunciation Variation in ASR Lexicons
  • MOST WORDS IN THE ASR LEXICONS HAVE A SINGLE
    PRONUNCIATION
  • EXCEPTIONS ARE HIGHLY FREQUENT WORDS (SUCH AS
    THE AND AND WHICH HAVE 2 OR 3 PRONUNCIATION
    VARIATIONS. NO WORD HAS MORE THAN 5
    PRONUNCIATION VARIANTS (AT LEAST NOT IN THE
    PHONETIC OUTPUT PROVIDED TO ICSI FOR THE
    EVALUATION)

43
Pronunciation Variation in Switchboard (2001)
  • THERE ARE DOZENS OF DIFFERENT PRONUNCIATIONS FOR
    THE 100 MOST FREQUENT WORDS IN THE PHONETIC
    EVALUATION MATERIAL

WORD INSTANCES PRON
WORD INSTANCES PRON
44
Pronunciation Variation in Switchboard (2001)
  • THERE ARE DOZENS OF DIFFERENT PRONUNCIATIONS FOR
    THE 100 MOST FREQUENT WORDS IN THE PHONETIC
    EVALUATION MATERIAL

WORD INSTANCES PRON
WORD INSTANCES PRON
45
Phone Error and Word Length (2000)
  • For CORRECT words, only one phone (on average) is
    misclassified
  • Implication short words are highly tolerant of
    phone errors
  • For INCORRECT words, phone errors increase
    linearly with word length
  • Data are averaged across all eight sites

46
Phone Error and Word Length (2001)
  • For CORRECT words, only one phone (on average) is
    misclassified
  • Implication short words are highly tolerant of
    phone errors
  • For INCORRECT words, phone errors increase
    linearly with word length
  • Data are averaged across all five sites

47
Phone Error - Forced Alignment (2000)
  • PHONE ERROR RATES VARY BETWEEN 35 AND 49
  • This, despite having the word transcript!!!

Site
Error Rate
ATT, Dragon did not provide a complete set of
forced alignments
Error Type
48
Phone Error - Forced Alignment (2001)
  • PHONE ERROR RATES VARY BETWEEN 40 AND 50
  • Same picture for 2001. Suggests a potential
    mismatch between lexical and phonetic
    representations

Site
Error Rate
Error Type
STRICT Time Mediation
Transcription UNcompensated
49
Phone Error - Forced Alignment (2001)
  • PHONE ERROR RATES VARY BETWEEN 30 AND 44
  • Still a poor match between phonetic transcripts
    and lexical reps

Site
Error Rate
Error Type
LENIENT Time Mediation
Transcription UNcompensated
50
Phone Error - Forced Alignment (2001)
  • PHONE ERROR RATES VARY BETWEEN 32 AND 38
  • Still a lack of concordance with a tolerant
    scoring method

Site
Error Rate
Error Type
STRICT Time Mediation
Transcription Compensated
51
Phone Error - Forced Alignment (2001)
  • PHONE ERROR RATES VARY BETWEEN 23 AND 29
  • With the most tolerant scoring there is still
    some lack of concordance

Site
Error Rate
Error Type
Transcription Compensated
LENIENT Time Mediation
52
Visualization of a 3-D Confusion Matrix
  • When the matrix is sparsely coded, as below, it
    is more efficient to view the pattern as if
    squashed against a brick wall (see below)

The diagonal is plotted in a linear plane
53
Phonetic Confusion Matrix - CVC Syllables
  • Onset consonants tend to be highly concordant
    with transcription
  • Coda consonants are slightly less concordant,
    particularly some fricatives

CVC
Proportion Concordance
CVC
Phonetic Segment
Forced Alignment
Numbers refer to the concordance diagonal in the
confusion matrices
54
Phonetic Confusions - CCVC, CVCC Syllables
  • Certain fricatives are problematic in CVCC coda
    position
  • Redo this figure and others - no wrong words,
    compare CVC, CVC etc,

CCVC
Proportion Concordance
CVCC
Phonetic Segment
Forced Alignment
Numbers refer to the concordance diagonal in the
confusion matrices
55
Phonetic Confusions - CV and CVC Nuclei
  • Diphthongs and tense, low monophthongs tend to be
    concordant
  • Lax monophthongs tend to be less concordant (cf.
    Stress-accent-paper)

CVC
Proportion Concordance
CV
Phonetic Segment
Forced Alignment
Numbers refer to the concordance diagonal in the
confusion matrices
56
Phone Error - Unconstrained Recognition (2000)
  • PHONE ERROR RATES VARY BETWEEN 39 AND 55
  • Phone error is only slightly greater than for
    forced alignments

57
Phone Error - Unconstrained Recognition(2001)
  • PHONE ERROR RATES VARY BETWEEN 44 AND 55
  • Results similar to 2000 evaluation

Condition most analogous to 2000 evaluation
Site
Error Rate
Error Type
Transcription Uncompensated
STRICT Time Mediation
58
Phone Error - Unconstrained Recognition (2001)
  • PHONE ERROR RATES VARY BETWEEN 38 AND 48
  • Relaxing time-mediation brings down the error
    slightly

Site
Error Rate
Error Type
LENIENT Time Mediation
Transcription Uncompensated
59
Phone Error - Unconstrained Recognition(2001)
  • PHONE ERROR RATES VARY BETWEEN 25 AND 39
  • Transcription compensation also brings down the
    error

Site
Error Rate
Error Type
STRICT Time Mediation
Transcription Compensated
60
Phone Error - Unconstrained Recognition(2001)
  • PHONE ERROR RATES VARY BETWEEN 27 AND 38
  • Phone errors decline somewhat more with lax
    scoring

Site
Error Rate
Error Type
LENIENT Time Mediation
Transcription Compensated
61
Phonetic Confusion Matrix - CV Onsets
  • ARROWS pinpoint problem segments
  • AFFRICATES and FRICATIVES are problematic in CV
    onset position
  • d is also problematic

Phonetic Segment
Unconstrained Recognition
Numbers refer to the concordance diagonal in the
confusion matrices
62
Phonetic Confusion Matrix - CVC Onsets
  • Fricatives and affricates are problematic in CVC
    onset position

Correct Words
Proportion Concordance
Wrong Words
Phonetic Segment
Unconstrained Recognition
Numbers refer to the concordance diagonal in the
confusion matrices
63
Phonetic Confusion Matrix - CCVC Onsets
  • Certain fricatives are particularly problematic
    in CCVC onset position

Correct Words
Proportion Concordance
Wrong Words
Phonetic Segment
Unconstrained Recognition
Numbers refer to the concordance diagonal in the
confusion matrices
64
Phonetic Confusion Matrix - CVC Codas
  • Fricatives are particularly problematic in CVC
    coda position
  • Certain Stops are also problematic in CVC coda
    position

Correct Words
Proportion Concordance
Wrong Words
Phonetic Segment
Unconstrained Recognition
Numbers refer to the concordance diagonal in the
confusion matrices
65
Phonetic Confusion Matrix - CVCC Codas
  • Certain fricatives are problematic in CVCC coda
    position
  • d is also problematic in CVCC coda position

Correct Words
Proportion Concordance
Wrong Words
Phonetic Segment
Unconstrained Recognition
Numbers refer to the concordance diagonal in the
confusion matrices
66
Phonetic Confusion Matrix - CVC Nuclei
  • Certain vowels are a problem in CVC nucleus
    position
  • Note that the level of concordance is much lower
    for vowels than for consonants (in onset or
    coda position), even for correct words

Correct Words
Proportion Concordance
Wrong Words
Phonetic Segment
Unconstrained Recognition
Numbers refer to the concordance diagonal in the
confusion matrices
67
Phonetic Confusion Matrix - CV Nuclei
  • Diphthongs and low, tense vowels are more
    concordant with the transcription than the lax
    monophthongs cf. Stress-accent paper

Correct Words
Proportion Concordance
Wrong Words
Phonetic Segment
Unconstrained Recognition
Numbers refer to the concordance diagonal in the
confusion matrices
68
Consonantal Onsets and AF Errors (2000)
  • Syllable onsets are intolerant of AF errors in
    CORRECT words
  • Place and manner AF errors are particularly high
    in INCORRECT onsets
  • Data are averaged across all eight sites

69
Consonantal Onsets and AF Errors (2001)
  • Syllable onsets are intolerant of AF errors,
    particularly place, in CORRECT words
  • Place and manner AF errors are particularly high
    in INCORRECT onsets
  • Syllable structure does not have the same effect
    as in the 2000 analysis
  • Data are averaged across all five sites

70
Consonantal Codas and AF Errors (2000)
  • Syllable codas exhibit a slightly higher
    tolerance for error than onsets
  • There is a high degree of AF error for wrong words
  • Data are averaged across all eight sites

71
Consonantal Codas and AF Errors (2001)
  • Syllable codas exhibit a slightly higher
    tolerance for error than onsets
  • There is a high degree of AF error for wrong words
  • Data are averaged across all five sites

72
Vocalic Nuclei and AF Errors (2000)
  • Nuclei exhibit a much higher tolerance for error
    than onsets codas
  • There are many more errors than among syllabic
    onsets codas
  • Data are averaged across all eight sites

73
Vocalic Nuclei and AF Errors (2001)
  • Nuclei exhibit a much higher tolerance for error
    than onsets codas, particularly for height and
    front/back
  • There are many more errors than among syllabic
    onsets codas
  • Data are averaged across all five sites

74
Into the (Near) Future
  • WITH THE ARRIVAL OF THE REMAINING
    FORCED-ALIGNMENT AND UNCONSTRAINED RECOGNITION
    DATA
  • IT will be possible to investigate in the
    relative contribution of the phonetic
    classification, pronunciation and language
    models to recognition performance
  • In order to do this, it is necessary to obtain
    unconstrained recognition, forced alignment and
    phone-confidence material from each site (to the
    extent possible) the phone confidence metric
    is problematic
  • CUSTOMIZED ANALYSES FOR INDIVIDUAL SITES
  • SRI has different versions of their system (with
    w/o adaptation, etc.)
  • ATT will use phone strings from ICSI
    transcription material
  • Individual diagnostics for each site (are there
    significant differences for specific
    parameters?)
  • MOST OF THE DATA FOR THE 2001 EVALUATION WILL BE
    POSTED ON THE PHONEVAL WEB SITE SHORTLY
  • WEB-BASED ORACLE DATABASE APPLICATION IS NEAR
    COMPLETION
  • Will enable searches over the web of the Phoneval
    corpus and be able to graph the results (this is
    the tricky part, given the ugly nature of Oracle
    Web DB)
  • A PAPER DESCRIBING THE FULL SET OF ANALYSES WILL
    BE AVAILABLE AT THE END OF JUNE (2001)

75
Summary and Conclusions
  • PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY
    FACTOR UNDERLYING THE ABILITY TO CORRECTLY
    RECOGNIZE WORDS
  • Many different analyses (to follow) support this
    conclusion
  • Consonants appear to be more important than
    vowels
  • SYLLABLE STRUCTURE IS ALSO AN IMPORTANT FACTOR
    FOR ACCURATE RECOGNITION
  • The pattern of errors differs across the syllable
    (onset, nucleus, coda) and exhibit consistent
    patterns difficult to discern with other units of
    analysis
  • STRESS-ACCENT MAY PLAY AN IMPORTANT ROLE,
    PARTICULARLY FOR UNDERSTANDING THE NATURE OF
    WORD-DELETION ERRORS
  • Relation among stress-accent, syllable structure,
    vocalic identity and length
  • THE NATURE OF PRONUNCIATION MODELS and THEIR
    RELATION TO LEXICAL REPRESENTATIONS IS A
    POTENTIALLY KEY FACTOR
  • The unit of lexical representation (phones,
    articulatory features, etc.) is probably of the
    utmost importance for optimizing ASR performance
  • FUTURE PROGRESS IN ASR SYSTEM DEVELOPMENT IS
    LIKELY TO DEPEND ON DEEP INSIGHT INTO THE
    NATURE OF SPOKEN LANGUAGE
Write a Comment
User Comments (0)
About PowerShow.com