Phonetics: Speech production and perception - PowerPoint PPT Presentation

About This Presentation
Title:

Phonetics: Speech production and perception

Description:

Phonology: Study of sound combinations ... Missing word, phrase boundaries, endings Many tonal variations during speech Varied vowel durations Common knowledge, ... – PowerPoint PPT presentation

Number of Views:350
Avg rating:3.0/5.0
Slides: 53
Provided by: Harv55
Learn more at: http://cs.sou.edu
Category:

less

Transcript and Presenter's Notes

Title: Phonetics: Speech production and perception


1
Introduction
  • Phonetics Speech production and perception
  • Phonology Study of sound combinations
  • Orthography Writing Systems
  • Well talk about each area and how they impact
    Natural Language Processing

2
Phonetics
Study of speech production and perception
  • Phone set of all sounds that humans can
    articulate
  • Phoneme - Distinct family of phones in a language
  • Languages utilize 15 40 phonemes
  • Note Too few distinct sounds for a language
    vocabulary
  • Ears tuned to hear a languages distinct phonemes
  • Languages are easy to speak and still be
    understood
  • Infer phoneme set find words differing in only
    one sound
  • Allophone variant realizations of a phoneme
  • Can be separate phonemes in other language
  • Segment All phones, phonemes, and allophones

3
Overview of the Noisy Channel
The Noisy Channel
  • Computational Linguistics
  • Replace the ear with a microphone
  • Replace the brain with a computer algorithm

4
Production
  • We have a complete but approximate of how speech
    is produced
  • We cannot accurately predict the audio signal
    corresponding to given articulatory positions
  • The best synthesis methods, for now, use
    concatenation-based algorithms to create
    computerized speech.
  • Model Pulmonic egressive air-stream from the
    source (glottis) through the vocal tract
    operating as source-filter.

5
Vocal Source
  • Speaker alters vocal tension of the vocal folds
  • If folds are opened, speech is unvoiced
    resembling background noise
  • If folds are stretched close, speech is voiced
  • Air pressure builds and vocal folds blow open
    releasing pressure and elasticity causes the
    vocal folds to fall back
  • Average fundamental frequency (F0) 60 Hz to 300
    Hz
  • Speakers control vocal tension alters F0 and the
    perceived pitch

Open
Closed
Period
6
Formants
  • Definition harmonics of F0
  • F1, F2, F3, etc.
  • Adds timbre to voiced sounds
  • Vowels have distinct harmonic patterns
  • Vocal articulators change emphasis of the
    harmonics and alter their frequencies
  • There are complex relationships between formants
    dependent on vocal musculature
  • Formants spread out as the pitch goes higher

7
Formant Speaker Variance
8
Vowel Formants
u
o
e
uh
eh
ih
ah
aw
ae
9
Vocal Tract
Note Velum is the soft pallette, epiglottis
guards protects the vocal cords
10
Another look at the vocal tract
11
Different Voices
  • Falsetto The vocal cords are stretched and
    become thin causing high frequency
  • Creaky Only the front vocal folds vibrate,
    giving a low frequency
  • Breathy Vocal cords vibrate, but air is
    escaping through the glottis
  • Each person tends to consistently use particular
    phonation patterns. This makes the voice uniquely
    theirs.

12
Vowels
No restriction of the vocal tract, articulators
alter the formants
  • Diphthong Syllabics which show a marked glide
    from one vowel to another, usually a steady vowel
    plus a glide
  • Nasalized Some air flow through the nasal cavity
  • Rounding Shape of the lips
  • Tense Sound more extreme (further from the
    schwa) and tend to have the tongue body higher
  • Relaxed Sounds closer to schwa (tonally neutral)
  • Tongue position Front to back, high to low

13
Vowel Characteristics
  • Demo of Vowel positions in the English language
  • http//faculty.washington.edu/dillon/PhonResources
    /vowels.html

Demo http//faculty.washington.edu/dillon/PhonRes
ources/vowels.html
Vowel Word high Low front back round tense F1 F2
Iy Feel - - - 300 2300
Ih Fill - - - - 360 2100
ae Gas - - - 750 1750
aa Father - - - - 680 1100
ah Cut - - - - - 720 1240
ao Dpg - - - - - - 600 900
ax Comply - - - - - 720 1240
eh Pet - - - 570 1970
ow Tone - - - - 600 900
uh Good - - - 380 950
uw Tool 300 940
14
Consonants
  • Significant obstruction in the nasal or oral
    cavities
  • Occur in pairs or triplets and can be voiced or
    unvoiced
  • Sonorant continuous voicing
  • Unvoiced less energy
  • Plosive Period of silence and then sudden energy
    burst
  • Lateral, semi vowels, retroflex partial air flow
    block
  • Fricatives, affricatives Turbulence in the wave
    form

15
Manner of Articulation
  • Voiced The vocal cords are vibrating, Unvoiced
    vocal cords dont vibrate
  • Obstruent Noise-like sounds
  • Fricative Air flow not completely shut off
  • Affricate A sequence of a stop followed by a
    fricative
  • Sibilant a consonant characterized by a hissing
    sound (like s or sh)
  • Trill A rapid vibration of one speech organ
    against another (Spanish r).
  • Aspiration burst of air following a stop.
  • Stop Air flow is cut off
  • Ejective airstream and the glottis are closed
    and suddenly released (/p/).
  • Plosive Voiced stop followed by sudden release
  • Flap A single, quick touch of the tongue (t in
    water).
  • Nasality Lowering the soft palate allows air to
    flow through the nose
  • Glides vowel-like, syllable position makes them
    short without stress (w, y)
  • On-glide glide before vowel, off-glide glide
    after vowel
  • Approximant (semi-vowels) Active articulator
    approaches the passive articulator, but doesnt
    totally shut of (L and R).
  • Laterality The air flow proceeds around the side
    of the tongue

16
Place of the Articulation
Articulation Shaping the speech sounds
  • Bilabial The two lips (p, b, and m)
  • Labio-dental Lower lip and the upper teeth (v)
  • Dental Upper teeth and tongue tip or blade
    (thing)
  • Alveolar Alveolar ridge and tongue tip or blade
    (d, n, s)
  • Post alveolar Area just behind the alveolar
    ridge and tongue tip or blade (jug ?, ship ?,
    chip ?, vision ?)
  • Retroflex Tongue curled and back (rolling r)
  • Palatal Tongue body touches the hard palate
    (j)
  • Velar Tongue body touches soft palate (k, g, ?
    (thing))
  • Glottal larynx (uh-uh, voiced h)

17
English Consonants
Type Phones Mechanism
Plosive b,p,d,t,g,k Close oral cavity
Nasal m, n, ng Open nasal cavity
Fricative V,f,z,s,dh,th,zh, sh Turbulent
Affricate jh, ch Stop Turbulent
Retroflex Liquid r Tongue high and curled
Lateral liquid l Side airstreams
Glide w, y Vowel like
18
Consonant Place and Manner
Labial Labio-dental Dental Aveolar Palatal Velar Glottal
Plosive p b t d k g ?
Nasal m n ng
Fricative f v th dh s z sh zh h
Retroflex sonorant r
Lateral sonorant l
Glide w y
19
Example word
20
Speech Production Analysis
  • Plate attached to roof of mouth measuring contact
  • Collar around the neck measuring glottis
    vibrations
  • Measure air flow from mouth and nose
  • Three dimension images using MRI
  • Note IPA was designed before the above
    technologies existed. They were devised by a
    linguist looking down someones mouth or feeling
    how sounds are made.

21
Perception
  • Some perceptual components are understood, but
    knowledge concerning the entire human perception
    model is rudimentary
  • Understood Components
  • The inner ear works as a filter bank
  • Sounds are perceived on a logarithmic scale
  • Some sounds will mask others

22
The Inner Ear
  • Two sensory organs are located in the inner ear.
  • The vestibule is the organ of equilibrium.
  • The cochlea is the organ of hearing.

23
Basilar Membrane
Note Basilar Membrane shown unrolled
  • Thin elastic fibers stretched across the cochlea
  • Short, narrow, stiff, and closely packed near the
    oval window
  • Long, wider, flexible, and sparse near the end of
    the cochlea
  • The membrane connects to a ligament at its end.
  • Separates two liquid filled tubes that run along
    the cochlea
  • The fluids are very different chemically and
    carry the pressure waves
  • A leakage between the two tubes causes a hearing
    breakdown
  • Provides a base for sensory hair cells
  • The hair cells above the resonating region fire
    more profusely
  • The fibers vibrate like the strings of a musical
    instrument.

24
Place Theory
Decomposing the sound spectrum
  • Georg von Bekesys Nobel Prize discovery
  • High frequencies excite the narrow, stiff part at
    the end
  • Low frequencies excite the wide, flexible part by
    the apex
  • Auditory nerve input
  • Hair cells on the basilar membrane fire near the
    vibrations
  • The auditory nerve receives frequency coded
    neural signals
  • A large frequency range is possible because the
    basilar membranes stiffness is exponential

Demo at http//www.blackwellpublishing.com/matthe
ws/ear.html
25
Hair Cells
  • The hair cells are in rows along the basilar
    membrane.
  • Individual hair cells have multiple strands or
    stereocilia.
  • The sensitive hair cells have many tiny
    stereocilia which form a conical bundle in the
    resting state
  • Pressure variations cause the stereocilia to
    dance wildly and send electrical impulses to
    the brain.

26
Firing of Hair Cells
  • There is a voltage difference across the cell
  • The stereocilia projects into the endolymph fluid
    (60mV)
  • The perylymph fluid surrounds the membrane of the
    haircells (-70mV)
  • When the hair cells moves
  • The potential difference increases
  • The cells fire

27
Speech Perception
  • We don't perceive speech linearly
  • The cochlea has rows of hair cells. Each row acts
    as a frequency filter.
  • The frequency filters overlap

From early place theory experiments
28
Absolute Hearing Threshold
  • The hearing threshold but varies at different
    frequencies.
  • An empirical formula approximates the SPL
    threshold SPL(f) 3.65(f/1000)-0.8-6.5e-0.6(f/10
    00-3.3)210-3(f/1000)4
  • The table measures the threshold for men (M) and
    women (W) ages 20 through 60

29
Sound Threshold Measurements
30
Intensity and Neural Response
  • Auditory response is a function of intensity
  • The response saturates at a maximum intensity
    level

From CMU Robust Speech Group
31
Bark and Mel Scales
32
Comparison of Frequency Perception Scales
  • Blue Bark Scale
  • Red Mel Scale
  • Green ERB Scale

Equivalent Rectangular Bandwidth (ERB) is an
unrealistic but simple rectangular approximation
to model the filters in the cochlea
33
Masking
  • Masking is a phenomenon in which perception of
    one sound is obscured by the presence of another
    sound
  • Masking occurs in both the time and frequency
    domains
  • Time One Tone occurs shortly before another tone
  • Frequency One tone is near the frequency of
    another
  • Experiment (Most involve single sin waves)
  • Fix one sound at a frequency and intensity
  • Varying a second sine waves intensity
  • When is the second sound heard?
  • Amplification of perception
  • Tones below the threshold of hearing can be
    perceived if they occur simultaneously and the
    total energy within a frequency band exceeds the
    threshold.

34
Masking Patterns
  • A narrow band of noise at 410 Hz
  • Note the asymmetrical pattern

From CMU Robust Speech Group
35
Time Domain Masking
  • Noise will mask a tone if
  • The noise is sufficiently loud
  • The delay is short
  • Intensity of the noise needs to increase with the
    delay length
  • There are two types of masking
  • Forward Noise masking a tone that follows
  • Backward A tone is masked by noise that follows
  • Delays
  • beyond 100 - 200 ms no forward masking occurs
  • Beyond 20 ms, no backward masking occurs.
    Training can reduce or eliminate the perceived
    backward masking.

36
Phonology
  • Study of sound combinations
  • Rule based
  • A finite state grammar can represent valid sound
    combinations in a language
  • Unfortunately, these rules are language-specific
  • Statistics based
  • Most other areas of Natural Language processing
    are trending to statistical-based methods

37
Syllables
  • Organizational phonological unit
  • Vowel between two consonants
  • Ambiguous positioning of consonants into
    syllables
  • Tree structured representation
  • Basic unit of prosody
  • Lexical stress inherent property of a word
  • Sentential stress speaker choice to emphasize or
    clarrify

38
Representing Stress
  • There have been unsuccessful attempts to
    automatically assign stress to phonemes
  • Notations for representing stress
  • IPA (International Phonetic Alphabet) has a
    diacritic symbol for stress
  • Numeric representation
  • 0 reduced, 1 normal, 2 stressed
  • Relative
  • Reduced (R) or Stressed (S)
  • No notation means undistinguished

39
Phonological Grammars
  • SPC Sound Pattern for English
  • 13 features for 8192 combinations
  • Complete descriptive grammar
  • Recent research
  • Trend towards context-sensitive descriptions
  • Little thought concerning computational
    feasibility
  • Its unlikely that listeners apply thousands of
    rules to perceive speech

40
Morphology
  • How phonemes combine to make words
  • Important for speech synthesis
  • Example singular to plural
  • Run to runs z sound (voiced)
  • Hit to Hits s sound (unvoiced)
  • Devise sets of rules of pronunciation

41
Orthography Writing Systems
  • Diacritics Accent marks
  • Prosody Stress, loudness, pitch, tone,
    intonation, and length
  • Written symbolic representation of speech
  • Wide symbol set representing a speech message
  • Narrow symbol set representing a speech signal
  • English-based phonetic Transcriptions Arpanet,
    Timit
  • IPA International Phonetic Alphabet
  • International standard attempt at a narrow
    transcription
  • Intent represent all sounds of known languages
  • Disadvantages
  • Misses articulator interrelationships
  • Multiple realizations of the same sound
  • Non-linearity of speech, articulators always
    moving

42
Narrow transcription Difficulties
  • Realizations are points in continuous space, not
    discrete
  • Sounds take characteristics of adjacent sounds
    (assimilation)
  • Sounds that are combinations of two
    (co-articulation)
  • Articulator targets are often not reached
  • Diphthongs combine different phonemes
  • Adding (epenthesis) or deleting (elision)
  • Missing word, phrase boundaries, endings
  • Many tonal variations during speech
  • Varied vowel durations
  • Common knowledge, familiar background leads to
    more sloppy speech with additional
    non-linearities.

43
Written English
  • Spellings are not consistent with regard to
    sounds
  • Same spelling, different sounds low vs. cow
  • Different spelling, same sounds cow, bough
  • Pronunciations of written languages evolve over
    time
  • If current written English was phonetically
    accurate
  • It would only apply to a single dialect
  • It would be wrong as soon as the population
    altered its speech patterns

44
George Bernard Shaws System
His Goal Replace the Latin alphabet with One
that is phonetically accurate Result It didn't
work. Language phonetics Are not static and the
population was not willing to switch to a new
writting
45
Pitman Shorthand
46
ARPABET English-based phonetic system
  • Phone Example Phone Example Phone Example
  • iy beat b bet p pet
  • ih bit ch chet r rat
  • eh bet d debt s set
  • ah but f fat sh shoe
  • x bat g get t ten
  • ao bought hh hat th thick
  • ow boat hy high dh that
  • uh book jh jet dx butter
  • ey bait k kick v vet
  • er bert l let w wet
  • ay buy m met wh which
  • oy boy em bottom
  • arr dinner n net y yet
  • aw down en button z zoo
  • ax about ng sing zh measure
  • ix roses eng washing
  • aa cot - silence

47
The International Phonetic Alphabet
48
IPA Vowels
Caution English tongue positions dont exactly
match the chart. For example, father in English
does not have the tongue position as far back the
IPA vowel chart shows.
49
IPA Diacritics
50
IPA Tones and Word Accents
51
IPA Supra-segmental Symbols
52
Newer Technologies
  • Voice XML
  • Framework for integrating human/machine dialogues
  • W3 Consortium standard
  • Input audio files or human speech
  • Output synthesized
  • Script interpreted by voice-browsers
  • SSML (speech synthesis markup language)
  • XML-based technology to standardize manipulation
    of synthesized speech
  • Others
  • SABLE (1998 Consortium)
  • SAPI (Microsoft Speech API )
Write a Comment
User Comments (0)
About PowerShow.com