Recent work on Language Identification - PowerPoint PPT Presentation

About This Presentation
Title:

Recent work on Language Identification

Description:

Title: Slide 1 Author: Fabio Castaldo Last modified by: laface Document presentation format: Personalizzato Other titles: Bitstream Vera Sans Wingdings Symbol Times ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 44
Provided by: FabioCa9
Category:

less

Transcript and Presenter's Notes

Title: Recent work on Language Identification


1
Recent work on Language Identification
Pietro Laface POLITECNICO di TORINO
Brno 28-06-2009 Pietro
LAFACE
2
Team
POLITECNICO di TORINO Pietro Laface Professor Fab
io Castaldo Post-doc Sandro Cumani PhD
student Ivano Dalmasso Thesis Student
LOQUENDO Claudio Vair Senior
Researcher Daniele Colibro Researcher Emanuele
Dalmasso Post-doc
3
Outline
  • Acoustic models
  • Fast discriminative training of GMMs
  • Language factors
  • Phonetic models
  • 1-best tokenizers
  • lattice tokenizers
  • LRE09
  • Incremental acquisition of segments for the
    development sets

4
Our technology progress
  • Inter-speaker compensation in feature space
  • GLDS / SVM models (ICASSP 2007) - GMMs
  • SVM using GMM super-vectors (GMM-SVM)
  • Introduced by MIT-LL for speaker recognition
  • Fast discriminative training of GMMs
  • Alternative to MMIE
  • Exploiting the GMM-SVM separation hyperplanes
  • MIT discriminative GMMs
  • Language factors

5
Acoustic Language Identification
  • Task similar to text-independent Speaker
    Recognition

Gaussian Mixture Models (GMM) - MAP adapted from
an Universal Background Model (UBM)
UBM
MAP
6
GMM super-vectors
Appending the mean value of all Gaussians in a
single stream we get a super-vector
We use GMM super-vectors
7
Using an UBM in LID
  • The frame based inter-speaker variation
    compensation approach estimates the inter-speaker
    compensation factors using the UBM
  • In the GMM-SVM approach all language GMMs share
    the same weights and variances of the UBM
  • The UBM is used for fast selection of Gaussians

8
Speaker/channel compensationin feature space
  • U is a low rank matrix (estimated offline)
    projecting the speaker/channel factors subspace
    in the supervector domain.
  • x(i) is a low dimensional vector, estimated using
    the UBM, holding the speaker/channel factors for
    the current utterance i.
  • is the occupation probability of the
    m-th Gaussian

9
Estimating the U matrix
  • Estimating the U matrix with a large set of
    differences between models generated using
    different utterances of the same speaker we
    compensate the distortions due to the
    inter-session variability ? Speaker recognition
  • Estimating the U matrix with a large set of
    differences between models generated using
    different speaker utterances of the same language
    we compensate the distortions due to
    inter-speaker/channel variability within the same
    language ? Language recognition

10
GMM-SVM
  • A GMM model is trained for each utterance, both
    in train and in test
  • Each GMM is represented by a normalized GMM
    super-vector
  • The normalization is necessary to define a
    meaningful comparison between GMM supervectors

11
Kullback-Leibler divergence
  • Two GMMs (i and j) can be compared using an
    approximation of the Kullback-Leibler divergence

The interesting property of this measure is that
12
Kullback-Leibler normalization
  • normalizing each supervector component according
    to
  • The normalized UBM supervector defines the origin
    of a new space
  • The KL divergence becomes an Euclidean distance
  • The SVM language models are created using a
    linear kernel in this KL space

13
GMM-SVM weakness
  • GMM-SVM models perform very well with rather long
    test utterances

It is difficult to estimate a robust GMM with a
short test utterance
Exploit the discriminative information given by
the GMM-SVM for fast estimation of discriminative
GMMs
14
SVM discriminative directions
w normal vector to the class-separation
hyperplane
15
GMM discriminative training
Utterance GMM
Language GMM
KL Space
  • Shift each Gaussian of a language model along its
    discriminative direction, given by the vector
    normal to the class-separation hyperplane in the
    KL space

16
Rules for selection of ak
  • A discriminative GMM moves away from its original
    - MAP adapted - model, which best matches the
    training (and test) data.
  • A large value of ak (shift size) ?
  • more discriminative model, but
  • worse likelihood than less discriminative models
  • Use a development set for estimating a

17
Experiments with 2048 GMMs
Pooled EER() of Discriminative 2048 GMMs, and
GMM-SVM on the NIST LRE tasks. In parentheses,
the average of the EERs of each language.
Year Models Models Models
Year Discriminative GMMs Discriminative GMMs GMM-SVM
Year 3s 10s 30s
1996 11.71 (13.71) 3.62 (4.92) 1.01 (1.37)
2003 13.56 (14.40) 5.50 (6.02) 1.42 (1.64)
2005 16.94 (17.85) 9.73 (11.07 ) 4.67 (5.81 )
256-MMI (Brno University 2006 IEEE Odyssey ) 256-MMI (Brno University 2006 IEEE Odyssey ) 256-MMI (Brno University 2006 IEEE Odyssey )
2005 17.1 8.6 4.6
18
Pushed GMMs (MIT-LL)
19
Language Factors
  • Eigenvoice modeling, , and
    the use of speaker factors as input features to
    SVMs, has recently been demonstrated to give good
    results for speaker recognition compared to the
    standard GMM-SVM approach (Dehak et al. ICASSP
    2009).
  • Analogy
  • Estimate an eigen-language space, and use the
    language factors as input features to SVM
    classifiers (Castaldo et al. submitted to
    Interspeech 2009).

20
Language Factors advantages
  • Language factors are low-dimension vectors
  • Training and evaluating SVMs with different
    kernels is easy and fast it requires the dot
    product of normalized language factors
  • Using a very large number of training examples is
    feasible
  • Small models give good performance

21
Toward an eigen-language space
  • After compensation of the nuisances of a GMM
    adapted from the UBM using a single utterance,
    residual information about the channel and the
    speaker remains.
  • However, most of the undesired variation is
    removed as demonstrated by the improvements
    obtained using this technique

22
Speaker compensated eigenvoices
  • First approach
  • Estimating the principal directions of the GMM
    supervectors of all the training segments before
    inter-speaker nuisance compensation would produce
    a set of language independent, universal
    eigenvoices.
  • After nuisance removal, however, the speaker
    contribution to the principal components is
    reduced to the benefit of language
    discrimination.

23
Eigen-language space
  • Second approach
  • Computing the differences between the GMM
    supervectors obtained from utterances of a
    polyglot speaker would compensate the speaker
    characteristics and would enhance the acoustic
    components of a language with respect to the
    others.
  • We do not have labeled databases including
    polyglot speakers
  • compute and collect the difference between GMM
    supervectors produced by utterances of speakers
    of two different languages irrespective of the
    speaker identity, already compensated in the
    feature domain

24
Eigen-language space
  • The number of these differences would grow with
    the square of utterances of the training set.
  • Perform Principal Component Analysis on the set
    of the differences between the set of the
    supervectors of a language and the average
    supervector of every other language.

25
Training corpora
  • The same used for LRE07 evaluation
  • All data of the 12 languages in the Callfriend
    corpus
  • Half of the NIST LRE07 development corpus
  • Half of the OSHU corpus provided by NIST for
    LRE05
  • The Russian through switched telephone network
  • Automatic segmentation

26
Eigenvalues of two language subspaces
The language subspace has higher eigenvalues, and
both curves show a sharp decrease for their first
13 eigenvalues, corresponding to the main
language discrimination directions.
27
LRE07 30s closed set test
Language factors minDCF is always better and
more stable
28
Pushed GMMs (MIT-LL)
29
Pushed eigen-language GMMs
The same approach to obtain discriminative GMMs
from the language factors
30
Min DCFs and (EER)
Models 30s 10s 3s
GMM-SVM (KL kernel) 0.029 (3.43) 0.085 (9.12) 0.201 (21.3)
GMM-SVM (Identity kernel) 0.031 (3.72) 0.087 (9.51) 0.200 (21.0)
LF-SVM (KL kernel) 0.026 (3.13) 0.083 (9.02) 0.186 (20.4)
LF-SVM (Identity kernel) 0.026 (3.11) 0.083 (9.13) 0.187 (20.4)
Discriminative GMMs 0.021 (2.56) 0.069 (7.49) 0.174 (18.45)
LF-Discriminative GMMs (KL kernel) 0.025 (2.97) 0.084 (9.04) 0.186 (19.9)
LF-Discriminative GMMs (Identity kernel) 0.025 (3.05) 0.084 (9.05) 0.186 (20.0)
31
Loquendo-Polito LRE09 System
Model Training
32
Phonetic models
  • Output layer
  • 700-1000 states for the language dependent
    phonetic units
  • Stationary units
  • 23 - 47
  • ASR Recognizer
  • phone-loop grammar with diphone transition
    constraints

33
Phone transcribers
  • ASR Recognizer
  • phone-loop grammar with diphone transition
    constraints
  • 12 phone transcribers for
  • French, German, Greek, Italian, Polish,
    Portuguese, Russian, Spanish, Swedish, Turkish,
    UK and US English.
  • The statistics of the n-gram phone occurrences
    collected from the best decoded string of each
    conversation segment

34
Phone transcribers
  • ANN models
  • Same phone-loop grammar - different engine
  • 10 phone transcribers for
  • Catalan, French, German, Greek, Italian, Polish,
    Portuguese, Russian, Spanish, Swedish, Turkish,
    UK and US English.
  • The statistics of the n-gram phone occurrences
    collected from the expected counts from a lattice
    of each conversation segment

35
Multigrams
  • Two different TFLLR kernels
  • trigrams
  • pruned multigrams
  • multigrams can provide useful information about
    the language by capturing word parts within the
    string sequences

36
Pruned Multigrams
For each phonetic transcriber, we discard all the
n-grams appearing in the training set less than
0.05 of the average occurrence of the unigrams
Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers Total number of n-grams for 12 language transcribers
N-gram 1 2 3 4 5 6
Pruned 461 11477 120200 114396 10738 443
37
Scoring
  • The total number of models that we use for
    scoring an unknown segment is 34
  • 11 channel dependent models (11 x 2)
  • 12 single channel models (2 telephone and 10
    broadcast models only).
  • 23 x 2 for MMIE GMMs (channel independent but
    M/F)

38
Calibration and fusion
Multi-class FoCal
max of the channel dependent scores
39
Language pair recognition
  • For the language-pair evaluation only the
    back-ends have been re-trained, keeping unchanged
    the models of all the sub-systems.

40
Telephone development corpora
  • CALLFRIEND - Conversations split into slices of
    150s
  • NIST 2003 and NIST 2005
  • LRE07 development corpus
  • Cantonese and Portuguese data in the 22 Language
    OGI corpus
  • RuSTeN -The Russian through Switched Telephone
    Network corpus

41
Broadcast development corpora
  • Incrementally created to include as far as
    possible the variability within a language due to
    channel, gender and speaker differences
  • The development data, further split in training,
    calibration and test subsets, should cover the
    mentioned variability

42
Problems with LRE09 dev data
  • Often same speaker segments
  • Scarcity of segments for some languages after
    filtering same speaker segments
  • Genders are not balanced
  • Excluding French, the segments of all languages
    are either telephone or broadcast.
  • No audited data available for Hindi, Russian,
    Spanish and Urdu on VOA3, only automatic
    segmentation was provided
  • No segmentation was provided in the first release
    of the development data for Cantonese, Korean,
    Mandarin, and Vietnamese
  • For these 8 missing languages only the language
    hypotheses provided by BUT were available for
    VOA2 data.

43
Additional audited data
  • For the 8 languages lacking broadcast data,
    segments have been generated accessing the VOA
    site looking for the original MP3 files
  • Goal collect 300 broadcast segments per
    language, processed to detect narrowband
    fragments
  • The candidates were checked to eliminate segments
    including music, bad channel distortions, and
    fragments of other languages

44
Development data for bootstrap models
  • The segments were distributed to these sets so
    that same speaker segments were included in the
    same set.
  • A set of acoustic (pushed GMMs) bootstrap models
    has been trained

45
Additional not-audited data from VOA3
  • Preliminary tests with the bootstrap models
    indicate the need of additional data
  • Selected from VOA3 to include new speakers in the
    train, calibration and test sets
  • assuming that the file label correctly identify
    the corresponding language

46
Speaker selection
  • Performed by means of a speaker recognizer
  • We process the audited segments before the others
  • A new speaker model is added to the current set
    of speaker models whenever the best recognition
    score obtained by a segment is less than a
    threshold

47
Additional not-audited data from VOA2
  • Enriching the training set
  • Language recognition has been performed using a
    system combining the acoustic bootstrap models
    and a phonetic system
  • A segment has been selected only if
  • the 1-best language hypothesis of our system had
    associated a score greater than a given (rather
    high) threshold
  • matched the 1-best hypothesis provided by the BUT
    system

48
Total number of segments for this evaluation
Set Corpora Corpora Corpora Corpora Corpora Corpora
Set voa3_A voa2_A ftp_C voa3_S voa2_S ftp_S
Train 529 116 316 1955 590 66
Extended train 114 22 65 2483 574 151
Development 396 85 329 1866 449 45
  • Suffix
  • A audited
  • C checked
  • S automatic segmentation
  • ftp ftp//8475.ftp.storage.akadns.net/mp3/voa/

49
Hausa- Decision Cost Function
DCF
50
Hindi- Decision Cost Function
DCF
51
Results on the development set
Test on Systems Systems Systems Systems Systems Systems
Test on Pushed GMMs MMIE GMMs 3-grams Multi-grams Lattice Fusion
Broadcast telephone 1.48 1.70 1.09 1.12 1.06 0.86
Broadcast subset 1.54 1.69 1.24 1.26 1.14 0.91
Telephone subset 2.00 2.51 1.45 1.49 1.42 1.21
Average minDCFx100 on 30s test segments
52
Korean - score cumulative distribution
t-t
t-b
b-t
Write a Comment
User Comments (0)
About PowerShow.com