Problems and Prospects in Collecting Spoken Language Data - PowerPoint PPT Presentation

About This Presentation

Title:

Problems and Prospects in Collecting Spoken Language Data

Description:

Number of Views:29

Avg rating:3.0/5.0

Slides: 23

Provided by: bibalexOr5

Learn more at: http://www.bibalex.org

Category:

more less

Transcript and Presenter's Notes

Title: Problems and Prospects in Collecting Spoken Language Data

1
Problems and Prospects in Collecting Spoken
Language Data

2
Outline

3
Need for Digital Library of Audio Video Data

Current and future data will be in audio and
video formats
Current technology makes it possible to digitize
and store such large amounts of data
Collection, storage and indexing of such data
makes it possible to provide information to
current and future generation
Acts as test bed for several research challenges
exists in organizing, indexing and retrieving
such large data collections
Algorithms for quick and easier access to the
information present in AV format by providing a
query using text / audio / video modes
Algorithms using multi-modal data for bio-metric
authentication
Development of multi-lingual speech synthesis and
speech recognition systems

4
Characteristics of Spoken Language Data

Message - Information to be conveyed
Speaker Who is the speaker?
His/her background Age, gender, literacy
levels, knowledge levels, mannerisms etc.
Emotions Anger, sad, happy etc.
Idiolect An individual distinctive style of
speaking
Medium of transmission Microphone, telephone,
satellite etc.
Environment - party-environment, airport/station,
Language
Dialect grammar and the vocabulary associated
with a regional or social use of a language.
Culture and civilization The richness of usage
of vocabulary, grammar etc, indicates the times
of the language and the society.

5
Characteristics of Spoken Language Data

How a language was spoken 25 years ago, 50 years
ago, 100 years ago and beyond?
How a famous poem was recited or sung by the
author?
How a particular language was spoken in different
geographical locations of a state/country?
How a particular language/dialect has evolved
over a period of time?
What were the rare languages/dialects (which were
no more in existence)?. How they were spoken?

6
Phase 0 Prototype data collection at IIIT Hyd

7
Phase 0 Prototype data collection at IIT Madras

8
Tools Aiding for Acquisition/Correction of Speech
Data

9
Lessons Learnt

10
Proposal for collection of larger Spoken Language
Data for IL

Focus of information present in speech mode
Collect spoken language data from all Indian
languages and also from neighboring countries
Collect about 200,000 (.2 M) hours of speech
As a part of JimBakers global project of
collecting 1 Million hours of speech

11
New in our approach

Collection of large speech data upto 200,000 (0.2
M) hours
All Indian languages and dialects
23 official Indian languages
Approx. 10,000 hours per language
All types Traditional, Read, spoken,
conversational, dialog, movies, broadcast etc.
All modes microphone, clean, telephone,
cellphone, satellite etc
Standard procedure for organizing, annotating and
indexing
More focus on larger collection (and elimination
than of correction)
Make available this data for general public use

12
Key Make-A-Difference Capability

Availability of information (Stories, lectures,
poems, books, articles) in spoken language
For illiterate
Vision Impaired
Collection and Storage of spoken language data of
popular as well as rare languages dialects
Promotes research and development in
Speech Technology
Speech-to-speech translation in Indian languages
Phonetic engine (Language Independent)
Speech synthesis (Text-to-speech for Indian
languages)
Speaker recognition (Text independent and
dependent)
Language Identification
Speech enhancement
Speech signal processing
Biometrics
Multimodal Audio-Video modes
Information Access, Storage and Retrieval