Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks - PowerPoint PPT Presentation

About This Presentation

Title:

Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks

Description:

Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan Lee and Helen Meng – PowerPoint PPT presentation

Number of Views:166

Avg rating:3.0/5.0

Slides: 23

Provided by: eduh1155

Category:

more less

Transcript and Presenter's Notes

Title: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks

1
Design, compilation and processing of CUCall a
set of Cantonese spoken language corpora
collected over telephone networks

by
W.K. Lo, P.C. Ching, Tan Lee and Helen Meng
The Chinese University of Hong Kong
at
ROCLING XIV
16th August 2001

2
Acknowledgment

The CUCall data collection is conducted under the
support from the Innovation and Technology Fund
(AF/96/99)
We are also grateful to the industrial sponsors
Group Sense Limited
SmarTone Mobile Communication Limited

3
Outline

Corpus Design and Organization
phonetically oriented
application oriented
Data Collection and Processing
Data Analysis
Conclusions

4
Part ICorpus Design and Organization
5
Overview

extension to the CUCorpora microphone speech
database
collection of telephone speech data over
fixed-line and mobile networks
allow phonetically oriented and domain specific
applications
rich phonetic coverage with speaking style
variations
words, phrases and digit strings for specific use

6
CUCall Organization
7
Phonetically Oriented

5719 sentences
select from the pools of CUSENT training and
testing set
target for phonetic coverage in a biphone context
90 short paragraphs
enrich the phonetic coverage in additional to the
sentence materials
capture the variations brought about by the
lengthy nature of the reading materials

8
Phonetically Oriented

6 spontaneous conversation
capture speakers spontaneous response
content is unlimited and unconstrained
contains all kinds of non-speech events, e.g.
correction, hesitation, skipped word,
questions must be simple and open-ended

9
Phonetically Oriented

Criteria for the questions design
simple enough for spontaneous response avoid
calculation, memory recall etc.
answers are expected to be different for
different speakers
responses may be either long or short
avoid answers that are relevant to speakers
privacy

10
Application Oriented

1440 words and phrases
simple words cover various domains
names of places
listed companies
foreign currencies
navigation commands
Digit strings
strings of digits of various length
all ten single digits
random generated strings of length 7, 8 and 16

11
Part IIData Collection and Processing
12
Collection Process

Preparation of reading materials
prepare reading materials as prompt sheets
separate male female, fixed mobile lines
Distribution of prompt sheet
distributed hierarchically through agents
Speakers call
speakers call automatic recording servers
they are identified by unique serial numbers
Questionnaire return
information on age, telephone network type are
collected

13
Data Collection System Set-up
Post-processing of data for various
targeted domains of applications
Recording End telephone outlet, telephony
hardware, recording system, data storage system
..
Calling End From any location, using any
telephone, by all walks of life
Recording Servers fixed-line connection to
local telephone companies
Telephone Companies mobile/fixed line network
Note CT board is Dialogic D/41-ESC
14
Post-processing of Data

Call validation
received prompt sheets are verified against the
recorded speech data
user information are entered into databases
Phonemic transcription
all accepted speech data are 100 phonemic
transcribed on initial-final level
Partitioning of collected data
collected data are partitioned properly
speech data and the transcriptions are organized
per speaker basis

15
Data Processing After Collection
. . /nei5-hou2-maa1/ . .
Validation identify successful recording sessions
Data Storage collected telephone speech data
Transcription accurate verbatim
transcription for the speech data
Distribution printing CDROM for distribution
Organization organize data for easy access
\speaker01\data\001.wav
\002.wav .
. \speaker01\annotate\001.xsc
\002.xsc
.
.
. /nei5-hou2-maa1/ /ngo5-hou2-hou2/ /nei5-ne1/ /do
u1-ng4-co3-laa1/ .
16
Part IIIData Analysis
17
Statistics of Reading Materials
Part
per speaker
tonal syl.
base syl.
syl. count
Phonetically oriented corpora
sent.
50 (out of 5719)
1399
579
4 to 31
para.
3 (out of 90)
768
418
23 to 120
Application-specific corpora
1-digit
10
7-digit
5
8-digit
5
16-digit
5
words
48 (out of 1440)
562
344
2 to 8
18
Frequency-of-frequency (FOF)
Sentence
Paragraph
19
Part IVConclusions
20
Current Status

the collection process is divided into several
stages
expected completion date March 2002
until now, over 200 hours of data (from 1000
speakers) has been collected
120 hours for phonetically oriented data
80 hours for application-specific data
over half of the collected have been phonemically
transcribed

21
Conclusions

design and collection process for the Cantonese
telephone speech corpora is presented
corpora are designed to cover both phonetically
oriented and application-specific data
include also long reading materials and open
questions for spontaneous data
details of post-processing and data analysis are
given

22
Thank You

Write a Comment

User Comments (0)