Problems and Prospects in Collecting Spoken Language Data - PowerPoint PPT Presentation

About This Presentation
Title:

Problems and Prospects in Collecting Spoken Language Data

Description:

Problems and Prospects in Collecting Spoken Language Data Kishore Prahallad Suryakanth V Gangashetty B. Yegnanarayana Raj Reddy IIIT Hyderabad, India – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 23
Provided by: bibalexOr5
Learn more at: http://www.bibalex.org
Category:

less

Transcript and Presenter's Notes

Title: Problems and Prospects in Collecting Spoken Language Data


1
Problems and Prospects in Collecting Spoken
Language Data
  • Kishore Prahallad
  • Suryakanth V Gangashetty
  • B. Yegnanarayana
  • Raj Reddy
  • IIIT Hyderabad, India
  • Carnegie Mellon University, USA.

2
Outline
  • Need for digital library of audio and video data
  • Characteristics of spoken language data
  • Prototype data collection
  • IIIT Hyderabad
  • IIT Madras
  • Lessons Learnt
  • Proposal to collect IL data
  • as a part of Jimbakers global project.

3
Need for Digital Library of Audio Video Data
  • Current and future data will be in audio and
    video formats
  • Current technology makes it possible to digitize
    and store such large amounts of data
  • Collection, storage and indexing of such data
    makes it possible to provide information to
    current and future generation
  • Acts as test bed for several research challenges
    exists in organizing, indexing and retrieving
    such large data collections
  • Algorithms for quick and easier access to the
    information present in AV format by providing a
    query using text / audio / video modes
  • Algorithms using multi-modal data for bio-metric
    authentication
  • Development of multi-lingual speech synthesis and
    speech recognition systems

4
Characteristics of Spoken Language Data
  • Message - Information to be conveyed
  • Speaker Who is the speaker?
  • His/her background Age, gender, literacy
    levels, knowledge levels, mannerisms etc.
  • Emotions Anger, sad, happy etc.
  • Idiolect An individual distinctive style of
    speaking
  • Medium of transmission Microphone, telephone,
    satellite etc.
  • Environment - party-environment, airport/station,
  • Language
  • Dialect grammar and the vocabulary associated
    with a regional or social use of a language.
  • Culture and civilization The richness of usage
    of vocabulary, grammar etc, indicates the times
    of the language and the society.

5
Characteristics of Spoken Language Data
  • How a language was spoken 25 years ago, 50 years
    ago, 100 years ago and beyond?
  • How a famous poem was recited or sung by the
    author?
  • How a particular language was spoken in different
    geographical locations of a state/country?
  • How a particular language/dialect has evolved
    over a period of time?
  • What were the rare languages/dialects (which were
    no more in existence)?. How they were spoken?

6
Phase 0 Prototype data collection at IIIT Hyd
  • High quality studio recordings
  • 2 hrs of single speaker recordings for speech
    synthesis
  • Telugu, Hindi, Tamil and Indian-English
  • Developed text to speech systems in these 4
    languages
  • Telephone and Cell-phone corpus
  • 150 hrs (540 speakers)
  • Telugu, Tamil and Marathi
  • Developed speech recognition systems in these 3
    languages

7
Phase 0 Prototype data collection at IIT Madras
  • 15 hours (72 speakers)
  • TV news in Tamil, Telugu and Hindi Languages
  • Text to speech systems (TTS)
  • Language Identification
  • Duration modeling for TTS systems

8
Tools Aiding for Acquisition/Correction of Speech
Data
  • Transcription correction tool (TCT)
  • Spoken errors at phone, syllable, word level
  • Background noise, abrupt begin or end, low SNR
  • TCT corrects the above errors in three levels
  • Audio Video Transcription Tool
  • Used to annotate movie databases
  • Correction of Segment labels
  • Emulabel

9
Lessons Learnt
  • Speech correction needs 3-6 times more than
    collection
  • Better to collect more data than correcting
  • Needs a unified framework
  • Standardize, processes, procedure and tools
  • Need larger collection of spoken and text corpora
  • For building practical speech systems in Indian
    languages

10
Proposal for collection of larger Spoken Language
Data for IL
  • Focus of information present in speech mode
  • Collect spoken language data from all Indian
    languages and also from neighboring countries
  • Collect about 200,000 (.2 M) hours of speech
  • As a part of JimBakers global project of
    collecting 1 Million hours of speech

11
New in our approach
  • Collection of large speech data upto 200,000 (0.2
    M) hours
  • All Indian languages and dialects
  • 23 official Indian languages
  • Approx. 10,000 hours per language
  • All types Traditional, Read, spoken,
    conversational, dialog, movies, broadcast etc.
  • All modes microphone, clean, telephone,
    cellphone, satellite etc
  • Standard procedure for organizing, annotating and
    indexing
  • More focus on larger collection (and elimination
    than of correction)
  • Make available this data for general public use

12
Key Make-A-Difference Capability
  • Availability of information (Stories, lectures,
    poems, books, articles) in spoken language
  • For illiterate
  • Vision Impaired
  • Collection and Storage of spoken language data of
    popular as well as rare languages dialects
  • Promotes research and development in
  • Speech Technology
  • Speech-to-speech translation in Indian languages
  • Phonetic engine (Language Independent)
  • Speech synthesis (Text-to-speech for Indian
    languages)
  • Speaker recognition (Text independent and
    dependent)
  • Language Identification
  • Speech enhancement
  • Speech signal processing
  • Biometrics
  • Multimodal Audio-Video modes
  • Information Access, Storage and Retrieval

13
Implementation Plan
  • Phase 1 (3.5 months)
  • 10 languages
  • 33,300 hours
  • Phase 2 (8 months)
  • 10 (of phase 1) languages
  • 66,000 hours
  • Phase 3 (10 months)
  • 13 - remaining languages
  • 80,000 hours

14
Mid-Term and Final Terms
  • Mid-Term
  • Phase 1, collection of 33,300 hours of speech
  • Collection, Storage and Indexing of speech data
    for public information access
  • Visible research output using the speech data
  • Demonstrations of speech technology products
  • Speech recognition in 10 languages
  • Final Term
  • Phase 1 Phase 2

15
Q A
16
Misc.
17
Impact of Audio Digital Library
  • Availability of information in spoken language
    form for illiterate and others
  • Promotes research in speech technology for Indian
    languages
  • Enable to develop speech technology products
    useful for common man
  • Examples
  • Speech-speech translation systems
  • For information exchange
  • Screen readers,
  • For illiterate and physically challenged
  • Naturally speaking dialog systems
  • For information access over voice mode

18
Phase 1 Time Estimate
  • Phase 1
  • 10 official Indian languages
  • Parallel collection of data
  • 3000 hours per language
  • 5,000 - 10,000 speakers
  • gt 10 min of speech each per speaker
  • Total 33,300 hours
  • Time Estimates ( 3.5 months all 10 languages)
  • 10 persons-team per language
  • Each person works
  • 8 hours a day
  • 30 mins of speech recording per hour
  • 1-3 speakers per hour
  • 240 mins of speech per day
  • 1-24 speakers per day,
  • 240 speakers per day
  • 20,000 speakers per language in 84 working days

19
Phase 1 Cost Estimate
  • Man power cost Rs 140 Lakhs
  • Equipment cost Rs 55 Lakhs
  • Communication cost Rs 40 Lakhs
  • Contingency (10) Rs 25 Lakhs
  • Total Cost Rs 2.6 Crores ( 565,000)

20
Man-Power Cost
  • Data collection Team Rs 86 lakhs
  • 10 (for data collection) x Rs 10 K PM
  • 10 (for data correction) x Rs 10 K PM
  • 1 data manager (Rs 15 K PM)
  • 4 months cost 8, 60, 000 per language
  • 5 engineers Rs 4 Lakhs
  • B.Tech Level (Rs 20,000 PM)
  • Gifts per speaker Rs 50 Lakhs
  • Rs 25 per speaker

21
Machines Cost
  • Machines
  • 30 servers Rs 30 Lakhs
  • 3 servers per languages
  • Each server has 4 ports for data collection
  • 30 CTI cards Rs 20 Lakhs
  • Storage 20 TB Rs 5 Lakhs
  • Two copies of 20 TB

22
Communications Cost
  • Telephonic charges Rs 20 Lakhs
  • Rs 1 per min (local telephonic charges)
  • Transportation Rs 20 Lakhs
Write a Comment
User Comments (0)
About PowerShow.com