Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks - PowerPoint PPT Presentation

About This Presentation
Title:

Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks

Description:

Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks by W.K. Lo, P.C. Ching, Tan Lee and Helen Meng – PowerPoint PPT presentation

Number of Views:166
Avg rating:3.0/5.0
Slides: 23
Provided by: eduh1155
Category:

less

Transcript and Presenter's Notes

Title: Design, compilation and processing of CUCall: a set of Cantonese spoken language corpora collected over telephone networks


1
Design, compilation and processing of CUCall a
set of Cantonese spoken language corpora
collected over telephone networks
  • by
  • W.K. Lo, P.C. Ching, Tan Lee and Helen Meng
  • The Chinese University of Hong Kong
  • at
  • ROCLING XIV
  • 16th August 2001

2
Acknowledgment
  • The CUCall data collection is conducted under the
    support from the Innovation and Technology Fund
    (AF/96/99)
  • We are also grateful to the industrial sponsors
  • Group Sense Limited
  • SmarTone Mobile Communication Limited

3
Outline
  • Corpus Design and Organization
  • phonetically oriented
  • application oriented
  • Data Collection and Processing
  • Data Analysis
  • Conclusions

4
Part ICorpus Design and Organization
5
Overview
  • extension to the CUCorpora microphone speech
    database
  • collection of telephone speech data over
    fixed-line and mobile networks
  • allow phonetically oriented and domain specific
    applications
  • rich phonetic coverage with speaking style
    variations
  • words, phrases and digit strings for specific use

6
CUCall Organization
7
Phonetically Oriented
  • 5719 sentences
  • select from the pools of CUSENT training and
    testing set
  • target for phonetic coverage in a biphone context
  • 90 short paragraphs
  • enrich the phonetic coverage in additional to the
    sentence materials
  • capture the variations brought about by the
    lengthy nature of the reading materials

8
Phonetically Oriented
  • 6 spontaneous conversation
  • capture speakers spontaneous response
  • content is unlimited and unconstrained
  • contains all kinds of non-speech events, e.g.
    correction, hesitation, skipped word,
  • questions must be simple and open-ended

9
Phonetically Oriented
  • Criteria for the questions design
  • simple enough for spontaneous response avoid
    calculation, memory recall etc.
  • answers are expected to be different for
    different speakers
  • responses may be either long or short
  • avoid answers that are relevant to speakers
    privacy

10
Application Oriented
  • 1440 words and phrases
  • simple words cover various domains
  • names of places
  • listed companies
  • foreign currencies
  • navigation commands
  • Digit strings
  • strings of digits of various length
  • all ten single digits
  • random generated strings of length 7, 8 and 16

11
Part IIData Collection and Processing
12
Collection Process
  • Preparation of reading materials
  • prepare reading materials as prompt sheets
  • separate male female, fixed mobile lines
  • Distribution of prompt sheet
  • distributed hierarchically through agents
  • Speakers call
  • speakers call automatic recording servers
  • they are identified by unique serial numbers
  • Questionnaire return
  • information on age, telephone network type are
    collected

13
Data Collection System Set-up
Post-processing of data for various
targeted domains of applications
Recording End telephone outlet, telephony
hardware, recording system, data storage system
..
Calling End From any location, using any
telephone, by all walks of life
Recording Servers fixed-line connection to
local telephone companies
Telephone Companies mobile/fixed line network
Note CT board is Dialogic D/41-ESC
14
Post-processing of Data
  • Call validation
  • received prompt sheets are verified against the
    recorded speech data
  • user information are entered into databases
  • Phonemic transcription
  • all accepted speech data are 100 phonemic
    transcribed on initial-final level
  • Partitioning of collected data
  • collected data are partitioned properly
  • speech data and the transcriptions are organized
    per speaker basis

15
Data Processing After Collection
. . /nei5-hou2-maa1/ . .
Validation identify successful recording sessions
Data Storage collected telephone speech data
Transcription accurate verbatim
transcription for the speech data
Distribution printing CDROM for distribution
Organization organize data for easy access
\speaker01\data\001.wav
\002.wav .
. \speaker01\annotate\001.xsc
\002.xsc
.
.
. /nei5-hou2-maa1/ /ngo5-hou2-hou2/ /nei5-ne1/ /do
u1-ng4-co3-laa1/ .
16
Part IIIData Analysis
17
Statistics of Reading Materials
Part
per speaker
tonal syl.
base syl.
syl. count
Phonetically oriented corpora
sent.
50 (out of 5719)
1399
579
4 to 31
para.
3 (out of 90)
768
418
23 to 120
Application-specific corpora
1-digit
10
7-digit
5
8-digit
5
16-digit
5
words
48 (out of 1440)
562
344
2 to 8
18
Frequency-of-frequency (FOF)
Sentence
Paragraph
19
Part IVConclusions
20
Current Status
  • the collection process is divided into several
    stages
  • expected completion date March 2002
  • until now, over 200 hours of data (from 1000
    speakers) has been collected
  • 120 hours for phonetically oriented data
  • 80 hours for application-specific data
  • over half of the collected have been phonemically
    transcribed

21
Conclusions
  • design and collection process for the Cantonese
    telephone speech corpora is presented
  • corpora are designed to cover both phonetically
    oriented and application-specific data
  • include also long reading materials and open
    questions for spontaneous data
  • details of post-processing and data analysis are
    given

22
Thank You
Write a Comment
User Comments (0)
About PowerShow.com