MandarinEnglish Information MEI: Investigating Translingual Speech Retrieval Johns Hopkins Universit - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

MandarinEnglish Information MEI: Investigating Translingual Speech Retrieval Johns Hopkins Universit

Description:

An example query (NYT, AP newswire) An example document (VOA) accompanied ... resembles MEI TDT task (queries and documents come from different news sources) ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 23
Provided by: sepc
Category:

less

Transcript and Presenter's Notes

Title: MandarinEnglish Information MEI: Investigating Translingual Speech Retrieval Johns Hopkins Universit


1
Mandarin-English Information (MEI)Investigating
Translingual Speech RetrievalJohns Hopkins
University Center of Language and Speech
ProcessingSummer Workshop 2000Progress Update
  • The MEI Team
  • August 2, 2000

2
Outline
  • Baseline (Pat, Gina, Wai-Kit)
  • Upper Bounds (Pat, Erika, Helen)
  • Climbing Upwards (Upcoming Research Problems)
  • translation (Gina, Jian Qiang)
  • word-subword fusion (Helen, Doug, Wai-Kit)
  • named entities , numerals (Helen, Sanjeev,
    Wai-Kit, Karen)
  • syllable lattice generation (Hsin-Min, Berlin)

3
The MEI Task
  • An example query (NYT, AP newswire)
  • An example document (VOA)
  • accompanied by raw anchor scripts

A China Airlines A-310 jetliner returning from
the Indonesian island of Bali with 197 passengers
and crew crashed and burst into flames Monday
night just short of Taipeis Chiang Kai-Shek
Airport. (full story used as query, typically
200-500 words)
4
Our Baseline System
Query
Audio documents
Query Term Selection (1 to full document)
Dragon Mandarin Speech Recognizer
Query Term Translation (dictionary-lookup)
Tokenized, hexified Chinese word sequence
Translated, hexified Chinese query terms
InQuery Retrieval Engine
Evaluate retrieval outputs
5
Our First Retrieval Experiment...
  • Queries
  • 17 exemplars
  • 1 per topic in TDT2 corpus
  • Documents
  • 2265 in all
  • 500 belong to at least 1 topic
  • others are off-topic or briefs
  • each topic has gt2 relevant documents

6
Our First Retrieval Experiment
  • No. of query terms selected 100 (sweep)
  • No. of alternative translations per term 1
  • Word-based retrieval
  • Average Precision 16.91

7
In Search of Upper Bounds...
  • Confounding factors on query side
  • term selection
  • translation (no. of terms, definition of a term,
    named entities, dictionary / COTS system)
  • Confounding factors on the document side
  • syllable recognition performance, OOV
  • word tokenization
  • Confounding factors in retrieval
  • word-based or subword-based (characters,
    syllables)
  • subword n-grams (n??)

8
Upper Bounds (Word)
  • Queries (ASR) Documents (ASR)
  • isolates the confounding factors (term selection,
    translation, recognition performance, word
    tokenization)
  • Ave Precision73.3
  • Queries (Xinhua) Documents (ASR/TKN)
  • isolate similar confounding factors
  • resembles MEI TDT task (queries and documents
    come from different news sources)
  • word tokenization (CETA / Dragon)
  • Best Ave Precision 53.5(ASR), 58.7 (TKN)

9
Chinese Words and Subwords
  • Characters (written) -gt syllables (spoken)
  • Degenerate mapping
  • /hang2/, /hang4/, /heng2/ or /xing2/
  • /fu4 shu4/ (LDCs CALLHOME lexicon)
  • Tokenization / Segmentation
  • /zhe4 yi1 wan3 hui4 ru2 chang2 ju3 xing2/

10
Upper Bounds (Subword)
  • Queries (Xinhua) Documents (ASR/TKN)
  • character-based retrieval
  • overlapping character n-grams (document,
    within-term for queries, bigrams fare best)
  • Best Ave Precision 54.3(ASR), 55.9(TKN)
  • overlapping bigrams in queries
  • Best Ave Precision 61.7 (cross-term overlap)
  • syllable-based retrieval
  • word tokenization affects syllable lookup
  • syllable bigrams fare best
  • Best Ave Precision 51.6(ASR), 53.3 (TKN)

11
Upper Bound (Translingual)
  • Putting back the translingual element
  • Selected English query terms --gt translated
    Chinese query terms (Oracle -- Jian Qiang Wang)
  • Retrieval performance
  • word-based (180 terms, no syn, sum) 50.6
  • subword-based retrieval (character bigrams, sum
    52.1, syn 52.3)
  • TKN??

12
Thus Far...
Ave Precision
TDT_English / ASR (???) perfect
translation, best index term set
Trying to climb up
13
Better Translation
  • translation alternatives per term
  • Current best (120 query terms, 3 translations per
    term, word-based retrieval, ASR reseg with CETA,
    sum 28.1)
  • (90 query terms, 2 translations pre term,
    word-based retrieval, ASR orig sum 27.53)
  • Phrase-based translation
  • 2 types of phrases (named entities,
    dictionary-based phrases)
  • term selection (consider both phrases and
    component words), higher terms
  • Current best (250 query terms, all translations,
    word-based retrieval, 43.3)

14
Word-Subword Fusion
  • Words incorporate lexical knowledge
  • Subwords are intended to handle the OOV problem
  • Combination of both may beat either alone
  • Ranked list of retrieved documents
  • from word-based retrieval
  • from subword-based retrieval

15
Merging Loose Coupling
1 voa4062 .22 2 voa3052 .21 3
voa4091 .17 1000 voa4221 .04
1 voa4062 .52 2 voa2156 .37 3
voa3052 .31 1000 voa2159 .02
  • Types of Evidence
  • Score
  • Rank
  • Score Combination
  • Max
  • Linear combination
  • Rank Combination
  • Round robin
  • Source bias
  • Query bias

1 voa40612 2 voa30522 3
voa40911 1000 voa42201
16
Tight Coupling Words and Bigrams
jiang
zhe
min
Lattice
qiang
ze
ming
Words Jiang Zemin
Words Jiang Zemin Bigrams
jiang_zhe jiang_ze qiang_zhe qiang_ze
zhe_min zhe_ming ze_min
ze_ming Combination jiang_zhe zhe_min Jiang
Zemin
17
Word-Subword Fusion(weighted similarity)
  • Merging ranked lists
  • Each retrieved document is scored
  • i denotes words, subword n-grams

18
Numerals and Named Entities
  • Verbalize numerals
  • Named Entities
  • BBN tags (names of locations, people,
    organization)
  • Derive Bilingual Term List from TDT2
  • English letter-to-phone generation
  • Cross-lingual phonetic mapping (English phones to
    Chinese phones)
  • Syllabification

19
Cross-Lingual Phonetic Mapping
Named entity Jiang Zemin, Kosovo
Syllabify Pinyin Spelling
E.g. jiang ze min
English Pronunciation Lookup or Letter-to-Phone
Generation
English Phones, e.g. k ao s ax v ow
Cross-lingual Phonetic Mapping
Chinese Phones, e.g. k e s u o w o
Syllabification
Chinese syllables, e.g. ke suo wo
20
Syllable Lattice for Document Representation
  • Address ASR errors and OOV
  • Augment Dragon ASR output with alternate syllable
    hypotheses
  • Generate syllable n-grams for audio indexing
  • Include into word-subword fusion

21
Lots to do still...
22
Mandarin-English Information Investigation
Translingual Speech Retrieval lthttp//www.glue.um
d.edu/meiwebgt Johns Hopkins University, Center
for Language and Speech Processing, JHU/NSF
Summer Workshop 2000
MEI Team Helen MENG (CUHK), Berlin CHEN
(National Taiwan University), Erika GRAMS
(Advanced Analytic Tools), Sanjeev KHUDANPUR
(JHU/CLSP), Gina-Anne LEVOW (University of
Maryland), Wai-Kit LO (CUHK) Douglas OARD
(University of Maryland), Patrick SCHONE
(Department of Defense), Karen TANG (Princeton
University), Hsin-Min WANG (Academia Sinica),
Jianqiang WANG (University of Maryland)
Input English text query
English to Chinese translation dictionary
Word sequence
Translated words and phrases
Baseline
Character n-gram sequence
Phrase tagging
Term Translation
Character n-gram generation
As of Sunday July 9, 2000
Query Processing
Unknown words and phrases
Query to INQUERY
Named Entity Tagger
Query Term Selection
INQUERY
Ranked List of Possibly Relevant Documents
Figure of Merit
Document to INQUERY
Scoring
Character n-gram sequence
Document Processing
Character n-gram generation
Relevance Assessments
Dragon Mandarin ASR
Spoken Mandarin documents
Word sequence
Segmented Chinese Text
Write a Comment
User Comments (0)
About PowerShow.com