Title: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks
1Design, compilation and processing of CUCall a
set of Cantonese spoken language corpora
collected over telephone networks
- by
- W.K. Lo, P.C. Ching, Tan Lee and Helen Meng
- The Chinese University of Hong Kong
- at
- ROCLING XIV
- 16th August 2001
2Acknowledgment
- The CUCall data collection is conducted under the
support from the Innovation and Technology Fund
(AF/96/99) - We are also grateful to the industrial sponsors
- Group Sense Limited
- SmarTone Mobile Communication Limited
3Outline
- Corpus Design and Organization
- phonetically oriented
- application oriented
- Data Collection and Processing
- Data Analysis
- Conclusions
4Part ICorpus Design and Organization
5Overview
- extension to the CUCorpora microphone speech
database - collection of telephone speech data over
fixed-line and mobile networks - allow phonetically oriented and domain specific
applications - rich phonetic coverage with speaking style
variations - words, phrases and digit strings for specific use
6CUCall Organization
7Phonetically Oriented
- 5719 sentences
- select from the pools of CUSENT training and
testing set - target for phonetic coverage in a biphone context
- 90 short paragraphs
- enrich the phonetic coverage in additional to the
sentence materials - capture the variations brought about by the
lengthy nature of the reading materials
8Phonetically Oriented
- 6 spontaneous conversation
- capture speakers spontaneous response
- content is unlimited and unconstrained
- contains all kinds of non-speech events, e.g.
correction, hesitation, skipped word, - questions must be simple and open-ended
9Phonetically Oriented
- Criteria for the questions design
- simple enough for spontaneous response avoid
calculation, memory recall etc. - answers are expected to be different for
different speakers - responses may be either long or short
- avoid answers that are relevant to speakers
privacy
10Application Oriented
- 1440 words and phrases
- simple words cover various domains
- names of places
- listed companies
- foreign currencies
- navigation commands
- Digit strings
- strings of digits of various length
- all ten single digits
- random generated strings of length 7, 8 and 16
11Part IIData Collection and Processing
12Collection Process
- Preparation of reading materials
- prepare reading materials as prompt sheets
- separate male female, fixed mobile lines
- Distribution of prompt sheet
- distributed hierarchically through agents
- Speakers call
- speakers call automatic recording servers
- they are identified by unique serial numbers
- Questionnaire return
- information on age, telephone network type are
collected
13Data Collection System Set-up
Post-processing of data for various
targeted domains of applications
Recording End telephone outlet, telephony
hardware, recording system, data storage system
..
Calling End From any location, using any
telephone, by all walks of life
Recording Servers fixed-line connection to
local telephone companies
Telephone Companies mobile/fixed line network
Note CT board is Dialogic D/41-ESC
14Post-processing of Data
- Call validation
- received prompt sheets are verified against the
recorded speech data - user information are entered into databases
- Phonemic transcription
- all accepted speech data are 100 phonemic
transcribed on initial-final level - Partitioning of collected data
- collected data are partitioned properly
- speech data and the transcriptions are organized
per speaker basis
15Data Processing After Collection
. . /nei5-hou2-maa1/ . .
Validation identify successful recording sessions
Data Storage collected telephone speech data
Transcription accurate verbatim
transcription for the speech data
Distribution printing CDROM for distribution
Organization organize data for easy access
\speaker01\data\001.wav
\002.wav .
. \speaker01\annotate\001.xsc
\002.xsc
.
.
. /nei5-hou2-maa1/ /ngo5-hou2-hou2/ /nei5-ne1/ /do
u1-ng4-co3-laa1/ .
16Part IIIData Analysis
17Statistics of Reading Materials
Part
per speaker
tonal syl.
base syl.
syl. count
Phonetically oriented corpora
sent.
50 (out of 5719)
1399
579
4 to 31
para.
3 (out of 90)
768
418
23 to 120
Application-specific corpora
1-digit
10
7-digit
5
8-digit
5
16-digit
5
words
48 (out of 1440)
562
344
2 to 8
18Frequency-of-frequency (FOF)
Sentence
Paragraph
19Part IVConclusions
20Current Status
- the collection process is divided into several
stages - expected completion date March 2002
- until now, over 200 hours of data (from 1000
speakers) has been collected - 120 hours for phonetically oriented data
- 80 hours for application-specific data
- over half of the collected have been phonemically
transcribed
21Conclusions
- design and collection process for the Cantonese
telephone speech corpora is presented - corpora are designed to cover both phonetically
oriented and application-specific data - include also long reading materials and open
questions for spontaneous data - details of post-processing and data analysis are
given
22Thank You