Title: Phonetics: Speech production and perception
1Introduction
- Phonetics Speech production and perception
- Phonology Study of sound combinations
- Orthography Writing Systems
- Well talk about each area and how they impact
Natural Language Processing
2Phonetics
Study of speech production and perception
- Phone set of all sounds that humans can
articulate - Phoneme - Distinct family of phones in a language
- Languages utilize 15 40 phonemes
- Note Too few distinct sounds for a language
vocabulary - Ears tuned to hear a languages distinct phonemes
- Languages are easy to speak and still be
understood - Infer phoneme set find words differing in only
one sound - Allophone variant realizations of a phoneme
- Can be separate phonemes in other language
- Segment All phones, phonemes, and allophones
3Overview of the Noisy Channel
The Noisy Channel
- Computational Linguistics
- Replace the ear with a microphone
- Replace the brain with a computer algorithm
4Production
- We have a complete but approximate of how speech
is produced - We cannot accurately predict the audio signal
corresponding to given articulatory positions - The best synthesis methods, for now, use
concatenation-based algorithms to create
computerized speech. - Model Pulmonic egressive air-stream from the
source (glottis) through the vocal tract
operating as source-filter.
5Vocal Source
- Speaker alters vocal tension of the vocal folds
- If folds are opened, speech is unvoiced
resembling background noise - If folds are stretched close, speech is voiced
- Air pressure builds and vocal folds blow open
releasing pressure and elasticity causes the
vocal folds to fall back - Average fundamental frequency (F0) 60 Hz to 300
Hz - Speakers control vocal tension alters F0 and the
perceived pitch
Open
Closed
Period
6Formants
- Definition harmonics of F0
- F1, F2, F3, etc.
- Adds timbre to voiced sounds
- Vowels have distinct harmonic patterns
- Vocal articulators change emphasis of the
harmonics and alter their frequencies - There are complex relationships between formants
dependent on vocal musculature - Formants spread out as the pitch goes higher
7Formant Speaker Variance
8Vowel Formants
u
o
e
uh
eh
ih
ah
aw
ae
9Vocal Tract
Note Velum is the soft pallette, epiglottis
guards protects the vocal cords
10Another look at the vocal tract
11Different Voices
- Falsetto The vocal cords are stretched and
become thin causing high frequency - Creaky Only the front vocal folds vibrate,
giving a low frequency - Breathy Vocal cords vibrate, but air is
escaping through the glottis - Each person tends to consistently use particular
phonation patterns. This makes the voice uniquely
theirs.
12Vowels
No restriction of the vocal tract, articulators
alter the formants
- Diphthong Syllabics which show a marked glide
from one vowel to another, usually a steady vowel
plus a glide - Nasalized Some air flow through the nasal cavity
- Rounding Shape of the lips
- Tense Sound more extreme (further from the
schwa) and tend to have the tongue body higher - Relaxed Sounds closer to schwa (tonally neutral)
- Tongue position Front to back, high to low
13Vowel Characteristics
- Demo of Vowel positions in the English language
- http//faculty.washington.edu/dillon/PhonResources
/vowels.html
Demo http//faculty.washington.edu/dillon/PhonRes
ources/vowels.html
Vowel Word high Low front back round tense F1 F2
Iy Feel - - - 300 2300
Ih Fill - - - - 360 2100
ae Gas - - - 750 1750
aa Father - - - - 680 1100
ah Cut - - - - - 720 1240
ao Dpg - - - - - - 600 900
ax Comply - - - - - 720 1240
eh Pet - - - 570 1970
ow Tone - - - - 600 900
uh Good - - - 380 950
uw Tool 300 940
14Consonants
- Significant obstruction in the nasal or oral
cavities - Occur in pairs or triplets and can be voiced or
unvoiced - Sonorant continuous voicing
- Unvoiced less energy
- Plosive Period of silence and then sudden energy
burst - Lateral, semi vowels, retroflex partial air flow
block - Fricatives, affricatives Turbulence in the wave
form
15Manner of Articulation
- Voiced The vocal cords are vibrating, Unvoiced
vocal cords dont vibrate - Obstruent Noise-like sounds
- Fricative Air flow not completely shut off
- Affricate A sequence of a stop followed by a
fricative - Sibilant a consonant characterized by a hissing
sound (like s or sh) - Trill A rapid vibration of one speech organ
against another (Spanish r). - Aspiration burst of air following a stop.
- Stop Air flow is cut off
- Ejective airstream and the glottis are closed
and suddenly released (/p/). - Plosive Voiced stop followed by sudden release
- Flap A single, quick touch of the tongue (t in
water). - Nasality Lowering the soft palate allows air to
flow through the nose - Glides vowel-like, syllable position makes them
short without stress (w, y) - On-glide glide before vowel, off-glide glide
after vowel - Approximant (semi-vowels) Active articulator
approaches the passive articulator, but doesnt
totally shut of (L and R). - Laterality The air flow proceeds around the side
of the tongue
16Place of the Articulation
Articulation Shaping the speech sounds
- Bilabial The two lips (p, b, and m)
- Labio-dental Lower lip and the upper teeth (v)
- Dental Upper teeth and tongue tip or blade
(thing) - Alveolar Alveolar ridge and tongue tip or blade
(d, n, s) - Post alveolar Area just behind the alveolar
ridge and tongue tip or blade (jug ?, ship ?,
chip ?, vision ?) - Retroflex Tongue curled and back (rolling r)
- Palatal Tongue body touches the hard palate
(j) - Velar Tongue body touches soft palate (k, g, ?
(thing)) - Glottal larynx (uh-uh, voiced h)
17English Consonants
Type Phones Mechanism
Plosive b,p,d,t,g,k Close oral cavity
Nasal m, n, ng Open nasal cavity
Fricative V,f,z,s,dh,th,zh, sh Turbulent
Affricate jh, ch Stop Turbulent
Retroflex Liquid r Tongue high and curled
Lateral liquid l Side airstreams
Glide w, y Vowel like
18Consonant Place and Manner
Labial Labio-dental Dental Aveolar Palatal Velar Glottal
Plosive p b t d k g ?
Nasal m n ng
Fricative f v th dh s z sh zh h
Retroflex sonorant r
Lateral sonorant l
Glide w y
19Example word
20Speech Production Analysis
- Plate attached to roof of mouth measuring contact
- Collar around the neck measuring glottis
vibrations - Measure air flow from mouth and nose
- Three dimension images using MRI
- Note IPA was designed before the above
technologies existed. They were devised by a
linguist looking down someones mouth or feeling
how sounds are made.
21Perception
- Some perceptual components are understood, but
knowledge concerning the entire human perception
model is rudimentary - Understood Components
- The inner ear works as a filter bank
- Sounds are perceived on a logarithmic scale
- Some sounds will mask others
22The Inner Ear
- Two sensory organs are located in the inner ear.
- The vestibule is the organ of equilibrium.
- The cochlea is the organ of hearing.
23Basilar Membrane
Note Basilar Membrane shown unrolled
- Thin elastic fibers stretched across the cochlea
- Short, narrow, stiff, and closely packed near the
oval window - Long, wider, flexible, and sparse near the end of
the cochlea - The membrane connects to a ligament at its end.
- Separates two liquid filled tubes that run along
the cochlea - The fluids are very different chemically and
carry the pressure waves - A leakage between the two tubes causes a hearing
breakdown - Provides a base for sensory hair cells
- The hair cells above the resonating region fire
more profusely - The fibers vibrate like the strings of a musical
instrument.
24Place Theory
Decomposing the sound spectrum
- Georg von Bekesys Nobel Prize discovery
- High frequencies excite the narrow, stiff part at
the end - Low frequencies excite the wide, flexible part by
the apex - Auditory nerve input
- Hair cells on the basilar membrane fire near the
vibrations - The auditory nerve receives frequency coded
neural signals - A large frequency range is possible because the
basilar membranes stiffness is exponential
Demo at http//www.blackwellpublishing.com/matthe
ws/ear.html
25Hair Cells
- The hair cells are in rows along the basilar
membrane. - Individual hair cells have multiple strands or
stereocilia. - The sensitive hair cells have many tiny
stereocilia which form a conical bundle in the
resting state - Pressure variations cause the stereocilia to
dance wildly and send electrical impulses to
the brain.
26Firing of Hair Cells
- There is a voltage difference across the cell
- The stereocilia projects into the endolymph fluid
(60mV) - The perylymph fluid surrounds the membrane of the
haircells (-70mV) - When the hair cells moves
- The potential difference increases
- The cells fire
27Speech Perception
- We don't perceive speech linearly
- The cochlea has rows of hair cells. Each row acts
as a frequency filter. - The frequency filters overlap
From early place theory experiments
28Absolute Hearing Threshold
- The hearing threshold but varies at different
frequencies. - An empirical formula approximates the SPL
threshold SPL(f) 3.65(f/1000)-0.8-6.5e-0.6(f/10
00-3.3)210-3(f/1000)4 - The table measures the threshold for men (M) and
women (W) ages 20 through 60
29Sound Threshold Measurements
30Intensity and Neural Response
- Auditory response is a function of intensity
- The response saturates at a maximum intensity
level
From CMU Robust Speech Group
31Bark and Mel Scales
32Comparison of Frequency Perception Scales
- Blue Bark Scale
- Red Mel Scale
- Green ERB Scale
Equivalent Rectangular Bandwidth (ERB) is an
unrealistic but simple rectangular approximation
to model the filters in the cochlea
33Masking
- Masking is a phenomenon in which perception of
one sound is obscured by the presence of another
sound - Masking occurs in both the time and frequency
domains - Time One Tone occurs shortly before another tone
- Frequency One tone is near the frequency of
another - Experiment (Most involve single sin waves)
- Fix one sound at a frequency and intensity
- Varying a second sine waves intensity
- When is the second sound heard?
- Amplification of perception
- Tones below the threshold of hearing can be
perceived if they occur simultaneously and the
total energy within a frequency band exceeds the
threshold.
34Masking Patterns
- A narrow band of noise at 410 Hz
- Note the asymmetrical pattern
From CMU Robust Speech Group
35Time Domain Masking
- Noise will mask a tone if
- The noise is sufficiently loud
- The delay is short
- Intensity of the noise needs to increase with the
delay length - There are two types of masking
- Forward Noise masking a tone that follows
- Backward A tone is masked by noise that follows
- Delays
- beyond 100 - 200 ms no forward masking occurs
- Beyond 20 ms, no backward masking occurs.
Training can reduce or eliminate the perceived
backward masking.
36Phonology
- Study of sound combinations
- Rule based
- A finite state grammar can represent valid sound
combinations in a language - Unfortunately, these rules are language-specific
- Statistics based
- Most other areas of Natural Language processing
are trending to statistical-based methods
37Syllables
- Organizational phonological unit
- Vowel between two consonants
- Ambiguous positioning of consonants into
syllables - Tree structured representation
- Basic unit of prosody
- Lexical stress inherent property of a word
- Sentential stress speaker choice to emphasize or
clarrify
38Representing Stress
- There have been unsuccessful attempts to
automatically assign stress to phonemes - Notations for representing stress
- IPA (International Phonetic Alphabet) has a
diacritic symbol for stress - Numeric representation
- 0 reduced, 1 normal, 2 stressed
- Relative
- Reduced (R) or Stressed (S)
- No notation means undistinguished
39Phonological Grammars
- SPC Sound Pattern for English
- 13 features for 8192 combinations
- Complete descriptive grammar
- Recent research
- Trend towards context-sensitive descriptions
- Little thought concerning computational
feasibility - Its unlikely that listeners apply thousands of
rules to perceive speech
40Morphology
- How phonemes combine to make words
- Important for speech synthesis
- Example singular to plural
- Run to runs z sound (voiced)
- Hit to Hits s sound (unvoiced)
- Devise sets of rules of pronunciation
41Orthography Writing Systems
- Diacritics Accent marks
- Prosody Stress, loudness, pitch, tone,
intonation, and length - Written symbolic representation of speech
- Wide symbol set representing a speech message
- Narrow symbol set representing a speech signal
- English-based phonetic Transcriptions Arpanet,
Timit - IPA International Phonetic Alphabet
- International standard attempt at a narrow
transcription - Intent represent all sounds of known languages
- Disadvantages
- Misses articulator interrelationships
- Multiple realizations of the same sound
- Non-linearity of speech, articulators always
moving
42Narrow transcription Difficulties
- Realizations are points in continuous space, not
discrete - Sounds take characteristics of adjacent sounds
(assimilation) - Sounds that are combinations of two
(co-articulation) - Articulator targets are often not reached
- Diphthongs combine different phonemes
- Adding (epenthesis) or deleting (elision)
- Missing word, phrase boundaries, endings
- Many tonal variations during speech
- Varied vowel durations
- Common knowledge, familiar background leads to
more sloppy speech with additional
non-linearities.
43Written English
- Spellings are not consistent with regard to
sounds - Same spelling, different sounds low vs. cow
- Different spelling, same sounds cow, bough
- Pronunciations of written languages evolve over
time - If current written English was phonetically
accurate - It would only apply to a single dialect
- It would be wrong as soon as the population
altered its speech patterns
44George Bernard Shaws System
His Goal Replace the Latin alphabet with One
that is phonetically accurate Result It didn't
work. Language phonetics Are not static and the
population was not willing to switch to a new
writting
45Pitman Shorthand
46ARPABET English-based phonetic system
- Phone Example Phone Example Phone Example
- iy beat b bet p pet
- ih bit ch chet r rat
- eh bet d debt s set
- ah but f fat sh shoe
- x bat g get t ten
- ao bought hh hat th thick
- ow boat hy high dh that
- uh book jh jet dx butter
- ey bait k kick v vet
- er bert l let w wet
- ay buy m met wh which
- oy boy em bottom
- arr dinner n net y yet
- aw down en button z zoo
- ax about ng sing zh measure
- ix roses eng washing
- aa cot - silence
47The International Phonetic Alphabet
48IPA Vowels
Caution English tongue positions dont exactly
match the chart. For example, father in English
does not have the tongue position as far back the
IPA vowel chart shows.
49IPA Diacritics
50IPA Tones and Word Accents
51IPA Supra-segmental Symbols
52Newer Technologies
- Voice XML
- Framework for integrating human/machine dialogues
- W3 Consortium standard
- Input audio files or human speech
- Output synthesized
- Script interpreted by voice-browsers
- SSML (speech synthesis markup language)
- XML-based technology to standardize manipulation
of synthesized speech - Others
- SABLE (1998 Consortium)
- SAPI (Microsoft Speech API )