MandarinEnglish Information MEI: Investigating Translingual Speech Retrieval Johns Hopkins Universit - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

MandarinEnglish Information MEI: Investigating Translingual Speech Retrieval Johns Hopkins Universit

Description:

Helen Meng Chinese University of Hong Kong. Erika Grams Advanced Analytic Tools ... President Bill Clinton and Chinese President Jiang Zemin engaged ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 65
Provided by: sepc
Category:

less

Transcript and Presenter's Notes

Title: MandarinEnglish Information MEI: Investigating Translingual Speech Retrieval Johns Hopkins Universit


1
Mandarin-English Information (MEI)Investigating
Translingual Speech RetrievalJohns Hopkins
University Center of Language and Speech
ProcessingSummer Workshop 2000
  • The MEI Team
  • August 23, 2000

2
MEI Team
  • Senior Members
  • Students

Helen Meng Chinese University of Hong
Kong Erika Grams Advanced Analytic
Tools Sanjeev Khudanpur Johns Hopkins
University Gina-Anne Levow University of
Maryland Douglas Oard University of
Maryland Patrick Schone US Department of
Defense Hsin-Min Wang Academia Sinica, Taiwan
Berlin Chen National Taiwan University Wai-Kit
Lo Chinese University of Hong Kong Karen Tang
Princeton University Jianqiang Wang University
of Maryland
3
Outline
  • Motivation
  • Background
  • The Multi-scale Paradigm
  • multi-scale query processing
  • multi-scale document indexing
  • multi-scale retrieval
  • The Perfect Retrieval Myth
  • Experiments and Findings
  • Conclusions and Future Work

4
Motivation
  • Monolingual speech retrieval applications are
    emerging, e.g.
  • http//speechbot.research.compaq.com

Internet-accessible Radio and Television Stations
source www.real.com, Feb 2000
5
Motivation (cont) Internet User Population
2000
2005
English
English
Chinese
Source Global Reach
6
MEI The Big Picture
7
Concept Demo
Karen, Erika
8
Two Prevailing Problems in CL-SDR
  • Translation problem
  • out-of-vocabulary (OOV) in translation
  • too many translations
  • Recognition problem
  • OOV in recognition
  • acoustic confusions
  • Solution subword units may help
  • transliteration, e.g.
  • Northern Ireland /bei3 ai4 er3 lan2/ (in query)
  • recognition of subword units, e.g.
  • Iraq -- a rock (in document)

9
Background for Mandarin Speech Recognition
  • 400 syllables
  • full phonological coverage in Mandarin Chinese
  • 6,800 characters
  • full textual coverage in written Chinese
    (GB-coded)
  • each character pronounced as a syllable
  • Unknown number of Chinese words
  • one to several characters per word
  • character combinations create different meanings
  • ambiguity in word tokenization

10
OOV and Acoustic Confusions in Mandarin SDR
Query Iraq...
11
Subwords for Retrieval
  • Character n-grams
  • robust to word-level mismatches due to different
    tokenization
  • Syllable n-grams
  • robust to word/character-level mismatches due to
    homophones
  • Partial matches possible

Pros
Con
  • Subwords contain reduced lexical knowledge c.f.
    words

12
The MEI Investigation
  • Use of a multi-scale representation for
    crosslingual spoken document retrieval (CL-SDR)
  • Words and subwords

Research Challenges
  • Multi-scale query translation
  • Multi-scale audio indexing
  • Multi-scale retrieval

13
President Bill Clinton and Chinese President
Jiang Zemin engaged in a spirited, televised
debate Saturday over human rights and
the Tiananmen Square crackdown, and announced a
string of agreements on arms control, energy and
environmental matters. There were no announced
breakthroughs on American human rights concerns,
including Tibet, but both leaders accentuated
the positive
Query by Example
English Newswire Exemplars
Mandarin Audio Stories
??????????????????????????????????????? ??????????
????????????????????????????? ????,???????????????
????????????,??????? ????????????????????????????
??????
14
Evaluation Collection
Development Collection TDT-2
Evaluation Collection TDT-3
Oct 98
Dec 98
Jun 98
Jan 98
17 topics, variable number of exemplars
56 topics, variable number of exemplars
English text topic exemplars Associated
Press New York Times
2265 manually segmented stories
3371 manually segmented stories
Mandarin audio broadcast news Voice of America
Mar 98
Exhaustive relevance assessment based on event
overlap
15
Abstract Task Model
American English Text Exemplar
Mandarin Chinese Broadcast News
Cross-Language Speech Retrieval
Ranked List of News Stories
16
Evaluation of Ranked Lists
Relevance Judgments
VOA 0427.22
Relevant
VOA 0521.14
Not
VOA 0604.39
Not
VOA 0419.12
Relevant
VOA 0513.17
Relevant
VOA 0527.13
Not


17
Recall-Precision Graph
18
Variation Across Exemplars
19
Average Across Exemplars
20
Variation Across Topics
1.0
0.8
0.6
Mean Uninterpolated
Average Precision
0.4
0.2
0.0
Topic
21
Comparing Two Systems
Topic
22
Significance Testing
  • Statistical significance
  • Null hypothesis mean average precision across
    topics is drawn from same distribution
  • Paired 2-tailed t-test, significant if p
  • For System A vs. System B, p0.94
  • Meaningful differences
  • Rule of thumb 5-10 relative
  • For System A vs. System B, relative difference is

23
Translingual and Multi-ScaleQuery Processing
24
Bilingual Term List
Relevance Judgments
English Exemplar
LDC
000100010000010100
President Bill Clinton and
LDC
Named Entity Tagging
Term Translation
Term Selection
Query Construction
BBN
Ranked List
Mandarin IR System
U Mass
Evaluation
Mandarin Audio
Document Construction
Speech Recognition
Cornell
Mean Uninterpolated Average Precision
LDC
Dragon
Story Boundaries
LDC
25
Multi-Scale Query Translation
  • Words and Phrases
  • (Gina, Sanjeev)
  • Subwords
  • (Helen, Wai-Kit, Berlin, Karen)

26
Bilingual Term List
  • Combination of
  • LDC English-Chinese bilingual term list
  • Chinese-English Translation Assistance File
    (CETA) inverted

199,444 395,216 81,127 105,750
Total English Terms Total Translation
Pairs Phrasal Terms Phrasal Translation Pairs
Term human right(s) human rights
translations 7 30 1
27
Query Term Selection
  • Tagged named entities (BBN Identifinder)
  • Person partners of Goldman, Sachs, Co.
  • Organization UN Security Council
  • Dictionary-based phrases
  • translatable multi-word units, e.g
  • Wall Street, best interests, guiding
    principles, human rights
  • automatic tagging greedy, left-to-right, max
    match
  • Chi-squared filtering
  • Compared to English background model

28
Query Term Translation
  • Named entities
  • if absent from dictionary, translate individual
    terms
  • e.g. Security Council versus First Bank of
    Siam
  • Numeric Expressions
  • special processing for digits
  • e.g. 1230 pm, June 15, 1969
  • Remaining terms
  • Consult bilingual term list, lemmatize if
    necessary
  • e.g. televised translates as television

29
Query Construction
  • Unbalanced queries
  • Use all plausible translations for each term
  • Balanced queries
  • Pseudo-term weight average of translations
    weights
  • Structured queries
  • Recompute pseudo-term weight from translations
    term frequency and document frequency

30
Strategies in Query Translation
  • Phrase based translation is significantly better
  • Named entities and numeral translations are
    (barely) helpful
  • Balanced translation matches Structured queries
  • also extends easily to subword units

31
Untranslatable Terms
suharto 97 ( of occurrences) netanyahu 88 starr
62 arafat 50 bjp 45 vajpayee 44 estrada 44 . h
su 19 zemin 7
32
Subword Transliteration
Kosovo (/ke1-suo3-wo4/, /ke1-suo3-fo2/,
/ke1-suo3-fu1, /ke1-suo3-fu2/)
Mandarin Audio Document
English Query Exemplar
..Kosovo...
../ke-suo-fo/.
Sound alike -- match in phonetic space?
33
Subword Transliteration Procedure (1)
34
Subword Transliteration Procedure (2)
35
Subword Transliteration
Chinese phone lattice generation Syllable bigram
language model N-best syllable sequence hyp
N1 (one-best hypothesis) /ji li si te fu/
(hyp) /ke li si tuo fu/ (ref)
36
Cross Lingual Phonetic Matching
  • Documents are indexed with syllable bigrams (in
    addition to words and character bigrams if
    necessary)
  • Query terms are translated as words where
    possible, phonetically where necessary

37
Multi-Scale Query Construction
  • Helen

38
Multi-Scale Query ConstructionObjectives
Bag of English query terms (selected)
Multi-scale query representation in Chinese
Query Construction
  • Multi-scale representation integrates
  • translated phrases, named entities, numeric
    expressions, translated terms
  • transliterated syllables
  • words, characters and syllable n-grams

39
Multi-Scale Query ConstructionProcedures
words syl bigrams
char syl bigrams
Syllable bigrams and Transliterations yi-se
se-lie shou-xiang ben-jie jie-ming ne-tan tan-ya
ya-hu
syl bigrams
40
Multi-Scale Audio Document Indexing
  • Hsin-min, Helen, Berlin, and Wai-kit

41
Previous Chinese Example
42
Audio Document IndexingObjectives
  • Augment words with subword-based indexing
  • Dragon word recognition outputs are provided
  • Character-based indexing
  • Characters derived from Dragons recognized words
  • Syllable-based indexing
  • Syllables derived by pronunciation lookup using
    Dragons recognized words
  • Address Dragons ASR errors
  • Augment with alternative (word/char/syl)
    hypotheses e.g. syllable lattice Chen Wang,
    ICASSP-2000

43
Syllable Lattice Development
  • Dragons recognition accuracies
  • Evaluated against anchor scripts
  • 82.0(word) 87.9(char) 92.1(syl)
  • Syllable substitution errors (5.2)
  • MEIs syllable recognition accuracy
  • Trained on Hub4 Mandarin (VOA, 11 hours, 1997)
  • 70.2 (syl) ? !!!

44
Strategy
  • Improve MEIs syllable recognizer
  • Design a structure for document indexing which
    incorporates
  • Dragons word / character / syllable hypotheses
  • MEIs syllable hypotheses
  • (hopefully complementary to Dragons syllables)

45
MEI Syllable RecognizerImprove Acoustic Models
VOA Audio for Doc i
Forced Alignment
Speaker Adaptation
Dragon Outputs for Doc i
Speaker-Adapted Acoustic Models
Baseline Acoustic Models
Syllable Recognition
MEI Syllables for Doc i
  • Forced alignment with Dragons output for each
    document
  • Blind speaker adaptation with Dragons syllables
  • MEI syllable accuracy 70.2(original)? 87.7
    ?!!!

46
MEI Syllable RecognizerIncorporate Language
Model
47
Audio Document Indexing withMultiple Syllable
Recognition Outputs
The revised syllable lattice
48
Multi-scale Audio Document Indexing
49
Fusion of Words and Subwordsin Multi-Scale
Retrieval
  • Wai-Kit Lo, Pat Schone

50
Loose Coupling
  • Merging ranked lists from separate runs
  • For each query and document pair, the score is
    recalculated as
  • wk are the weights for different retrieval runs
  • K denotes a retrieval run at some scale (word,
    characters, syllables, combinations)
  • Sk (Qi, Dj) is a rank-based score between query i
    and document j in retrieval run k

51
Loose Coupling
Word
Word
Word
Char2
Char2
Syl2
52
Tight Coupling
  • Unified indexing of words and subword ngrams
  • For query and documents
  • Combine terms at different scales to form a
    multi-scale query/document representation, e.g.
  • Multi-scale retrieval produces a single ranked
    list

yi-se se-lie shou-xiang ben-jie jie-ming ne-tan
tan-ya ya-hu
53
Loose vs Tight Coupling
  • Tight coupling combines document scores before
    ranking
  • may need weight optimization
  • Loose coupling combines lists post-hoc
  • outperforms individual lists

54
The Perfect Retrieval Myth
  • Erika, Helen, Hsin-Min, Jian Qiang, Berlin

55
The Perfect Retrieval Myth
  • 100 Average Precision ALL relevant docs and
    ZERO non-relevant docs retrieved

Is corrupted by ...
56
Bounds on Word-Based Systems
  • Using Mandarin VOA documents as exemplars
  • matched condition
  • Using Xinhua text documents as exemplar
  • source mismatch
  • Using manual translations of NYT documents as
    exemplars

57
Bounds on Subword-Based Systems
  • Character bigrams for indexing
  • marginally outperforms word-based systems
  • Syllable bigrams
  • are quite competitive, though somewhat behind
  • Mean average precision 0.6 is a good CL-SDR
    target

58
TDT-2 Results
59
Retrieval Performance on TDT2
60
TDT-3 Results
61
Retrieval Performance on TDT3
62
Summary and Conclusions
  • Novel multi-scale paradigm for CL-SDR
  • ameliorates the translation and recognition OOV
    problems
  • Multi-scale query and document processing
  • cross-lingual subword transliteration procedure
    (CLPM)
  • query and document construction embeds words /
    characters / syllables
  • balanced and structured queries
  • Multi-scale retrieval
  • tight and loose coupling strategies to fuse words
    and subwords for retrieval

63
Summary and Conclusions (2)
  • Extensive experiments on TDT-2, TDT-3
  • character bigrams typically outperform words or
    syllable bigrams in retrieval
  • fusion of word and subword units shows potential
    in multi-scale retrieval
  • syllable lattice needs further investigation

64
Future Work
  • Word-subword fusion techniques merit further
    investigation
  • Multi-scale query expansion for retrieval
    performance improvement (Wai-Kit)
  • Incorporation of acoustic scores in syllable
    lattice representation for documents

65
END
Write a Comment
User Comments (0)
About PowerShow.com