Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Qualitative Evaluation of Language Model Rescoring in

Automatic Speech Recognition


Thibault Bañeras Roux, Mickael Rouvier, Jane Wottawa, Richard Dufour

To cite this version:


Thibault Bañeras Roux, Mickael Rouvier, Jane Wottawa, Richard Dufour. Qualitative Evaluation of
Language Model Rescoring in Automatic Speech Recognition. Interspeech, Sep 2022, Incheon, South
Korea. �hal-03712735�

HAL Id: hal-03712735


https://hal.archives-ouvertes.fr/hal-03712735
Submitted on 4 Jul 2022

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Qualitative Evaluation of Language Model Rescoring in Automatic Speech
Recognition
Thibault Bañeras Roux1 , Mickaël Rouvier2 , Jane Wottawa3 , Richard Dufour1
1
LS2N - Nantes University (France)
2
LIA - Avignon University (France)
3
LIUM - Le Mans University (France)
[email protected], [email protected],
[email protected], [email protected]

Abstract though we know that words have a different impact considering


a targeted task [4]. These limitations have already been exposed
Evaluating automatic speech recognition (ASR) systems is a in the past, with proposed variants such as the IWER [5], which
classical but difficult and still open problem, which often boils focuses on words chosen as important within a transcription.
down to focusing only on the word error rate (WER). However, In this paper, we investigate a set of automatic measures
this metric suffers from many limitations and does not allow used in various natural language processing (NLP) tasks to
an in-depth analysis of automatic transcription errors. In this help in the specific evaluation of ASR systems, especially on
paper, we propose to study and understand the impact of rescor- language-related aspects. These measures should allow for a
ing using language models in ASR systems by means of several finer analysis of transcription errors, by highlighting certain
metrics often used in other natural language processing (NLP) forms of the errors (part-of-speech, context errors, semantic dis-
tasks in addition to the WER. In particular, we introduce two tance, etc.). One of the advantages of these proposed measures
measures related to morpho-syntactic and semantic aspects of is that they do not require any additional manual annotation of
transcribed words: 1) the POSER (Part-of-speech Error Rate), transcriptions and can be applied to any language. Moreover,
which should highlight the grammatical aspects, and 2) the Em- their multiplication allows us to put forward different visions
bER (Embedding Error Rate), a measurement that modifies the of the errors, these metrics can then complement each other.
WER by providing a weighting according to the semantic dis- We then propose a qualitative analysis using these metrics on
tance of the wrongly transcribed words. These metrics illustrate a state-of-the-art ASR system, by analyzing in more details the
the linguistic contributions of the language models that are ap- contribution of a posteriori reordering of transcription hypoth-
plied during a posterior rescoring step on transcription hypothe- esis, a process called rescoring, performed with a quadrigram
ses. language model (LM) coupled to a Recurrent Neural Net Lan-
Index Terms: Automatic speech recognition, Semantic analy- guage Model (RNNLM) on a French dataset.
sis, Language modeling, evaluation metrics This paper is organized as follows: in Section 2, we de-
scribe the classical WER metric, before listing and detailing
1. Introduction the different automatic measures we propose to allow a finer
evaluation of transcriptions at a linguistic level. In order to un-
Over the last years, various speech and language processing derstand the interest of these measures, a qualitative analysis
fields have made significant progress thanks to scientific and of language model rescoring is proposed, first detailing the ex-
technological advances. Automatic Speech Recognition (ASR) perimental protocol in Section 3, then the results and analysis
has notably benefited from the massive increase in available in Section 4. Finally, a conclusion as well as perspectives are
data and the use of deep learning approaches [1, 2], making its provided in Section 5.
models more robust and efficient [3]. From an application point
of view, several usage contexts are possible: an automatic tran-
scription can either be used directly (e.g. for automatic subti-
2. Description of proposed measures
tling), or it can be part (often as an input) of another application ASR systems are mainly evaluated through the WER. In this
(e.g. human-computer dialogue, automatic indexing of audio section, we first describe it (Section 2.1) in order to highlight its
documents, etc.). Despite the current performance, errors in au- advantages and limitations. Then we detail the 6 complemen-
tomatic transcriptions are inevitable and impact its use: for ex- tary automatic measures that we wish to apply to the evaluation
ample, ASR errors can affect applications where these systems of automatic transcriptions at the syntactic (Sections 2.2, 2.3
are implemented, and thus negatively influence their global per- and 2.4) and semantic (Sections 2.5, 2.6 and 2.7) levels in addi-
formance by making it difficult for humans to understand the tion to the WER.
transcriptions.
ASR systems are widely evaluated with the Word Error 2.1. Word Error Rate (WER)
Rate (WER) metric. The simplicity of this metric is its main
This metric compares a reference (manual) transcription with
advantage and the reason of its massive adoption, as it only re-
an automatic transcription obtained with an ASR system on
quires a reference transcription (i.e. manually annotated) of the
the word level, words being a chain of characters between two
words. It is nevertheless limited in the sense that no other infor-
blanks. The WER then simply takes into account three types of
mation than the word itself is integrated (e.g. no linguistic in-
errors: substitutions (S), insertions (I) and deletions (D).
formation is taken into account, no semantic knowledge, etc.).
Each error also has the same weight within this metric even • Substitution (S): in a given word chain, one transcribed
word was different from the reference word. 2.5. Embeddings Error Rate (EmbER)
• Insertion (I): in a given word chain, a transcribed word
was inserted with respect of the reference. The hypothe- As previously exposed, the semantic aspect of a transcription is
sis counts one word more than the reference. not taken into account in the WER metric. To address this, we
consider a metric based on lexical word embeddings. Unlike
• Deletion (D): in a given word chain, a word in the ref- existing metrics based on word embeddings, we aim at keeping
erence was not transcribed. The hypothesis counts one the WER but weighting it: a word is no longer considered in
word less than the reference. a binary way (0 for a good transcript and 1 for an error), er-
The following example sentences illustrates an alignment rors being weighted according to their semantic distance from
between a reference sentence (Reference) and an automatic the reference word. This distance is computed using the cosine
transcription (Hypothesis) allowing the calculation of the WER: similarity between the embeddings of the reference word and of
the substituted transcribed word.
Reference How are you today Patrick
S D = I = S
Hypothesis Were you here today playing 2.6. BERTScore

Formally, the WER is calculated as follows: Developed for text generation [7], this metric aims at compar-
ing a reference word and a hypothesis with respect to semantic
#S + #I + #D proximity. The first step consists in obtaining the words and
W ER = (1) sub-words (tokens) of the reference and the hypothesis thanks
#ref erence words
By definition, the WER therefore considers any type of er- to the WordPiece tokenizer used by BERT [8].
ror of equivalent importance. This is the main advantage of this Then, given the sequence of contextualized embeddings of
metric: its simplicity of application and use. However, the WER reference (x1,...,xk) and hypothesis (x̂1,...,x̂m), the cosine sim-
does have limitations. Using the previous example, the word ilarity is computed between each reference and hypothesis em-
Patrick was transcribed as playing. An alternative transcription beddings to obtain a score matrix weighted here with the inverse
hypothesis could have been Patricia. In both cases, the WER frequency of the document [7].
would be identical to the reference, even though the nature of To compute the precision, we associate each token x with
the error is different (Patricia is in the same grammatical cate- a token x̂ by selecting the token bringing the highest similar-
gory while playing is different from the reference word in terms ity. The recall is computed by associating each x̂ token with
of syntactics and semantics). Another limitation concerns the an x token in the same way. The f-measurement score, which
few categories considered (substitution, insertion, deletion) for we use in our experiments, is computed with the recall and the
the rate calculation carrying no additional information about the precision [7].
context.

2.2. Character Error Rate (CER) 2.7. Sentence Semantic Distance (SemDist)
The character error rate (CER) is based on the same principle as While previous metrics focus on words and characters, the prin-
the WER but applied to character chains instead of word chains. ciple of this metric [9] is to consider the complete sentence. In
It has already been used in the ASR domain [6]. Initially, it is the ASR framework, the reference and the hypothesis are re-
particularly suitable for character-based languages such as Chi- spectively transformed into their sentence embeddings using a
nese or Japanese. For Latin languages, and in particular French, SentenceBERT [10] model, i.e. a model of sentence embed-
the CER allows, among other things, to give an indication of the dings using the contextual word embeddings of BERT [8]. It is
nature of the errors: a low CER could indicate that the ASR sys- then possible to compare these vectors with the cosine similar-
tem tends to generate words close to the reference (and thus po- ity. Our final measure is the average of the cosine similarities
tentially incorporating errors related to gender, number, tense, between each reference’s sentence embeddings and its respec-
etc.) as opposed to a high CER, with transcription assumptions tive hypothesis.
that are very distant from the references.

2.3. Part-of-speech Error Rate (POSER) 3. Experimental protocol


We also chose to use a metric allowing the calculation of the In this section, we present the experimental protocol set up to
error rate on the part-of-speech (POS) classes of a transcrip- apply the different metrics listed in Section 2. We describe the
tion (POSER for Part-of-speech Error Rate). POSER allows data used for our qualitative analysis of language model rescor-
us to know if the transcribed sentences are grammatically close ing in Section 3.1, the ASR system and the POS tagger in Sec-
to the reference ones, and to better characterize substitution er- tions 3.2 and 3.3 respectively. Finally, we present the embed-
rors. This rate is calculated with the same formula as the WER, dings used by the different metrics and the lemmatizer.
except that POS are taken into account instead of words which
relates to metadata of the transcribed words.
3.1. Data
2.4. Lemma Error Rate (LER)
The French datasets used to train the ASR system are ESTER 1
With a concept similar to the POSER and the WER, we did a and 2 [11, 12], EPAC [13], ETAPE [14], REPERE [15] and in-
Lemma Error Rate which consists of calculating the error rate ternal LIA data. Taken together, the corpora represent approx-
of lemmas. We did two versions of this metric : one computing imately 940 hours of audio of radio and television broadcast
the WER and one computing the CER between the lemmas of data. The evaluation of the systems is done on the REPERE test
the reference and the lemmas of the hypothesis. corpus, which is about 10 hours of audio data.
WER CER LER LCER dPOSER uPOSER EmbER SemDist BERTScore
WER
CER 89.34
LER 88.08 88.49
LCER 87.10 98.31 91.40
dPOSER 92.96 90.02 92.70 89.51
uPOSER 90.40 90.58 93.69 90.81 97.95
EmbER 96.51 91.51 86.57 88.78 91.00 88.98
SemDist 71.81 64.78 62.22 62.60 65.33 64.13 75.73
BERTScore 74.63 74.27 72.60 73.00 74.09 74.25 84.51 63.35
Table 1: Averages of the Pearson correlations between the proposed metrics from both Base and Rescoring systems. For readability
reasons, the values are multiplied by 100.

3.2. Automatic Speech Recognition (ASR) system we use the default multilingual-BERT base model4 .
The ASR system is based on an existing state-of-the-art recipe1
that uses the Kaldi [16] toolkit. The acoustic model is a deep 4. Experiments and Analysis
neural network based on the TDNNF [17] architecture. To make
the system more robust to different acoustic conditions, the au- This section presents firstly an analysis of the six applied met-
dio files were randomly perturbed in speed and volume (i.e. data rics presented in Section 2 in addition to the WER, and secondly
augmentation) during the training process. a qualitative study of the impact of the language model rescor-
Three language models are used. The first is a trigram ing process used in our ASR system.
model trained with SRILM [18] and used directly by the ASR
system. The second is a RNNLM, a deep neural network based 4.1. Metrics analysis
language model, used in an a posterior rescoring process. The
network consists of three TDNN layers interspersed with two In order to make a more in-depth analysis of our metrics, in par-
LSTM layers. Also, a quadrigram model is used during the ticular to understand and estimate the links that they can main-
rescoring step. The training corpus and the vocabulary used to tain between them, we calculated a Pearson correlation between
learn the trigram model, the RNNLM model and the quadri- our different measurements for our two systems and averaged
gram model are identical. The rescoring is optional as we want them in Table 1. The higher the score between two metrics, the
to observe its impact on the different metrics. more they are considered correlated. Clearly, the first remark is
that not all metrics correlate with each other in the same way.
3.3. Tools SemDist is the metric that correlates the least with the others.
This might be due to the fact that it is the only metric based
We used the POET tool2 , a POS tagger for French language on sentence embeddings in our experiments, going beyond the
based on Flair [19] contextual embeddings and used to automat- word dimension. This weak correlation implies that minimiz-
ically extract the morpho-syntactic information from words. We ing the WER would not correlate strongly with better perfor-
chose this labeler because it allows us to have both the generic mance on downstream tasks (i.e. extrinsic evaluation) using
classes of Universal Dependency (noun, adjective, adverb, etc.) sentence embeddings. This idea is consistent with many pub-
but also a fine granularity thanks to additional information on lications in NLP and ASR that consider intrinsic ratings to be
these same labels (feminine plural noun, third person plural less relevant than extrinsic ratings [21, 22]. Indeed, the authors
personal pronoun, etc.). We then propose two measures based of SemDist [9] concluded that their metric correlated better with
on POS tags derived from the POSER (Section 2.3): one inte- downstream tasks than the WER.
grating the detailed classes (dPOSER) and one with the generic
classes of Universal Dependency (uPOSER). Note that no man- We can see that the metric that correlates best with
ual POS tag annotation was used: both reference and hypothesis BERTScore and SemDist is EmbER, all three of which are
transcripts were automatically tagged. To obtain the lemmas, based on embeddings, while the metric that correlates best with
we used the Spacy lemmatizer for French3 . EmbER is WER. This highlights that the Embedding Error Rate
is a hybrid metric that has the advantage of correlating with
For the EmbER metric (Section 2.5), we used Fasttext em-
WER and embeddings-based metrics.
beddings [20] and applied an error of 0.1 if the cosine simi-
larity is above a threshold of 0.4, and 1 in other cases. The An interesting observation to make is that LER correlate the
threshold was decided empirically given the cosine similarity best with uPOSER and has a better correlation with dPOSER
between synonyms compare to cosine similarity between words than LCER. It seems that part-of-speech and lemmas share
randomly chosen. some similarity : if the lemma is wrong, the POS is often wrong.
For the SemDist metric (Section 2.7), the multilingual Sen- Also, the LCER and the CER have a correlation of 0.9831 which
tenceBERT embeddings was used. Finally, for the BERTScore, probably means that when the CER is high, there is a good
chance that the word is wrong too and so is the lemma. On the
1 https://github.com/kaldi-asr/kaldi/blob/
other hand, it also means that the LCER does not bring more
information than the CER.
master/egs/librispeech/s5/
2 https://huggingface.co/qanastek/pos-french
3 https://github.com/explosion/spacy-models/
releases/tag/fr_core_news_md-3.2.0 4 https://github.com/Tiiiger/bert_score
System WER CER dPOSER uPOSER LER LCER SemDist BERTScore EmbER
Base 15.45 8.57 14.59 12.22 14.35 8.78 7.89 9.12 12.33
Rescoring 13.24 7.70 12.51 10.79 12.08 8.00 7.18 8.38 10.79
Reduction -14.3 % -10.2 % -14.3 % -11.7 % -15.8 % -8.8 % -9.0 % -8.1 % -12.5 %
Table 2: Performance comparison of the Base and Rescoring systems using different metrics. The observed reduction between the two
systems, in relative value, is also provided.

4.2. Rescoring Impact Base Rescoring Reduction


INTJ 14.07 10.45 -3.63
In order to improve the performance of the ASR, rescoring was CCONJ 9.83 6.82 -3.01
acheived using a RNNLM, a deep neural network based lan- VERB 6.10 4.20 -1.90
guage model. ADJ 5.08 3.41 -1.67
AUX 4.66 3.27 -1.39
Table 2 presents the results obtained with the different met- PRON 5.37 4.12 -1.25
rics applied to the automatic transcriptions from the ASR sys- SCONJ 3.51 2.43 -1.08
tem without (Base) and with hypothesis reordering (Rescoring). PROPN 6.72 5.82 -0.90
As expected, rescoring improves the results with a decrease of NOUN 3.34 2.57 -0.77
error rates independently of the used metric: an improvement
ADV 3.23 2.49 -0.74
is thus visible at the level of words, characters, syntax and se-
ADP 2.90 2.25 -0.65
mantics. The gains for each metric are also provided in Table 2.
DET 2.95 2.42 -0.53
They mainly highlight the fact that the relative gain obtained
on the WER is the highest compared to the other metrics. De- NUM 2.96 2.62 -0.34
pending on the purpose of the system, the quality of a tran- Table 3: Average semantic distance per POS between each word
scription can be defined by its grammatical, lexical or semantic from the reference and their associated word from the hypothe-
similarity with the reference. We therefore imagine that the ben- sis. For readability. the values are multiplied by 100.
efits obtained thanks to this rescoring step are not as significant
as what the WER suggests. In comparison, the SemDist and
BERTScore metrics have the lowest relative gains, which tends
to make us say that rescoring only partially corrects transcribed 5. Conclusions and Perspectives
words that were semantically far from their reference. The pro- In this study, we applied different measures in addition to the
posed EmbER, which is a mixed measure between WER and WER metric to ASR systems in order to reveal different lin-
embeddings, seems to take into account the syntactic and se- guistic dimensions (grammatical, semantic, etc.) to transcrip-
mantic level, with a gain between that of WER and embed- tion errors.
ding measurements. Overall, language model rescoring con- We have chosen to verify their relevance by studying the
tributes less to the improvement of the semantic level (SemDist, impact of a posteriori hypothesis reordering on ASR systems
BERTScore and EmbER) compared to the syntactic level, vis- using language models. Our study showed that the gains are
ible with a huger reduction on the character, POS and lemma not equivalent depending on the metric considered, thus high-
based measures. lighting the limitations of WER alone to study improvements
Thanks to the meta information annotated in the REPERE at the lexical, grammatical or semantic level. It is important to
corpus, we could observe that the rescoring process deteriorates note that the rescoring improve overall performances, though
performances on utterances of spontaneous speech. In average, the increase in performance is not always visible locally.
utterances presenting more errors after the rescoring step con- In the continuity of this work, we would like to extend this
tained 1.23 times more spontaneity information (elisions, re- analysis by combining the measures. Indeed, we have been in-
duction, truncations and others disfluences). terested here in these metrics independently, but it seems rele-
vant to study, for example, semantic measures on identified POS
This is in line with the hypothesis we made: a speech with (e.g., compare BERTScore on personal names and adjectives).
too much disfluences (and so a mismatch between linguistic Also, this study focuses on the linguistic aspect of ASR, while
training and testing conditions) might be negatively impacted we observed that segments with high speech spontaneity clues
by rescoring. may be negatively impacted by the rescoring process. It would
With respect to POS, we propose in Table 3 to measure then be interesting to continue this study at the acoustic level,
the average cosine distance between every reference and hy- by looking in particular into other audio factors such as noise
pothesis word. We computed this distance without (Base) and or speech overlap. In the longer term, it would be interesting
with rescoring, while providing the relative reduction for each to evaluate the correlation between our metrics and human per-
POS. This highlighted that Interjections (INTJ) and subordinat- ception of errors.
ing conjunctions (CCONJ), and to a lesser extent verbs (VERB)
and adjectives (ADJ), are the word categories that benefit the 6. Acknowledgments
most from rescoring while numbers (NUM) or determinants This work was financially supported by the DIETS project fi-
(DET) are among the POS classes that benefit the less from this nanced by the Agence Nationale de la Recherche (ANR) under
additional step. The reason for the improvement of the interjec- contract ANR-20-CE23-0005.
tions is probably because this POS is the one with the highest
error rate.
7. References [17] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi,
and S. Khudanpur, “Semi-orthogonal low-rank matrix factoriza-
[1] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural tion for deep neural networks.” in Interspeech, 2018, pp. 3743–
network learning for speech recognition and related applications: 3747.
An overview,” in IEEE International Conference On Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 2013, pp. 8599– [18] A. Stolcke, “Srilm-an extensible language modeling toolkit,” in
8603. Seventh international conference on spoken language processing,
[2] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Bat- 2002.
tenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen [19] A. Akbik, D. Blythe, and R. Vollgraf, “Contextual string embed-
et al., “Deep speech 2: End-to-end speech recognition in english dings for sequence labeling,” in Proceedings of the 27th interna-
and mandarin,” in International conference on machine learning. tional conference on computational linguistics, 2018, pp. 1638–
PMLR, 2016, pp. 173–182. 1649.
[3] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec [20] P. Bojanowski, É. Grave, A. Joulin, and T. Mikolov, “Enrich-
2.0: A framework for self-supervised learning of speech repre- ing word vectors with subword information,” Transactions of the
sentations,” Advances in Neural Information Processing Systems, Association for Computational Linguistics, vol. 5, pp. 135–146,
vol. 33, pp. 12 449–12 460, 2020. 2017.
[4] M. Morchid, R. Dufour, and G. Linarès, “Impact of word er- [21] Y.-Y. Wang, A. Acero, and C. Chelba, “Is word error rate a good
ror rate on theme identification task of highly imperfect human– indicator for spoken language understanding accuracy,” in IEEE
human conversations,” Computer Speech & Language, vol. 38, workshop on Automatic Speech Recognition and Understanding
pp. 68–85, 2016. (ASRU). IEEE, 2003, pp. 577–582.
[5] S. Mdhaffar, Y. Estève, N. Hernandez, A. Laurent, R. Dufour, and
[22] G. Glavaš, R. Litschko, S. Ruder, and I. Vulić, “How to (prop-
S. Quiniou, “Qualitative evaluation of asr adaptation in a lecture
erly) evaluate cross-lingual word embeddings: On strong base-
context: Application to the pastel corpus.” in InterSpeech, 2019,
lines, comparative analyses, and some misconceptions,” in Pro-
pp. 569–573.
ceedings of the 57th Annual Meeting of the Association for Com-
[6] M. Xu, S. Li, and X.-L. Zhang, “Transformer-based end-to-end putational Linguistics, 2019, pp. 710–721.
speech recognition with local dense synthesizer attention,” in
IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2021, pp. 5899–5903.
[7] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi,
“Bertscore: Evaluating text generation with bert,” arXiv preprint
arXiv:1904.09675, 2019.
[8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-
training of deep bidirectional transformers for language under-
standing,” arXiv preprint arXiv:1810.04805, 2018.
[9] S. Kim, D. Le, W. Zheng, T. Singh, A. Arora, X. Zhai, C. Fue-
gen, O. Kalinli, and M. L. Seltzer, “Evaluating user perception of
speech recognition system quality with semantic distance metric,”
arXiv preprint arXiv:2110.05376, 2021.
[10] N. Reimers and I. Gurevych, “Sentence-bert: Sentence
embeddings using siamese bert-networks,” in Proceedings of the
2019 Conference on Empirical Methods in Natural Language
Processing. Association for Computational Linguistics, 11
2019. [Online]. Available: http://arxiv.org/abs/1908.10084
[11] S. Galliano, E. Geoffrois, G. Gravier, J.-F. Bonastre, D. Mostefa,
and K. Choukri, “Corpus description of the ester evaluation cam-
paign for the rich transcription of french broadcast news.” in In-
ternational Conference on Language Resources and Evaluation
(LREC), 2006, pp. 139–142.
[12] S. Galliano, G. Gravier, and L. Chaubard, “The ester 2 evaluation
campaign for the rich transcription of french radio broadcasts,” in
Tenth Annual Conference of the International Speech Communi-
cation Association, 2009.
[13] Y. Esteve, T. Bazillon, J.-Y. Antoine, F. Béchet, and J. Farinas,
“The epac corpus: manual and automatic annotations of conver-
sational speech in french broadcast news,” in International Con-
ference on Language Resources and Evaluation (LREC), 2010.
[14] G. Gravier, G. Adda, N. Paulsson, M. Carré, A. Giraudel, and
O. Galibert, “The etape corpus for the evaluation of speech-based
tv content processing in the french language,” in International
Conference on Language Resources and Evaluation (LREC),
2012, pp. 114–118.
[15] A. Giraudel, M. Carré, V. Mapelli, J. Kahn, O. Galibert, and
L. Quintard, “The repere corpus: a multimodal corpus for person
recognition,” in International Conference on Language Resources
and Evaluation (LREC), 2012, pp. 1102–1107.
[16] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
“The kaldi speech recognition toolkit,” in IEEE workshop on Au-
tomatic Speech Recognition and Understanding (ASRU). IEEE
Signal Processing Society, 2011.

You might also like