Zied Elloumi - Performance prediction of Automatic Speech recognition systems

Zied Elloumi



  • Laurent Besacier, professeur, Université Grenoble Alpes (Directeur
  • Jean-Francois Bonastre, professeur, Université d'Avignon (Rapporteur
  • Denis Jouvet, professeur, Université de Lorraine (Rapporteur
  • Julien Pinquier, maître de conférences HDR, IRIT (Examinateur
  • Olivier Galibert, ingénieur de recherche, LNE (Co-encadrant
  • Benjamin Lecouteux, maître de conférences, Université Grenoble Alpes (Co-encadrant


The presentation will be given in French. 


In this thesis, we focus on performance prediction of automatic speech recognition (ASR) systems. This is a very useful task to measure the reliability of transcription hypotheses for a new data collection, when the reference transcription is unavailable and the ASR system used is unknown (black box). Our contribution focuses on several areas: first, we propose a heterogeneous Frenchcorpustolearnandevaluate ASRpredictionsystems. Wethencomparetwo prediction approaches: a state-of-the-art (SOTA) performance prediction based on engineered features and a new strategy based on learnt features using convolutional neural networks (CNNs). While the joint use of textual and signal features did not work for the SOTA system, the combination of inputs for CNNs leads to the best WER prediction performance. We also show that our CNN prediction remarkably predicts the shape of the WER distribution on a collection of speech recordings. Then, we analyze factors impacting both prediction approaches. We also assess the impact of the training size of prediction systems as well as the robustness of systems learned with the outputs of a particular ASR system and used to predict performance on a new data collection. Our experimental results show that both prediction approaches are robust and that the prediction task is more difficult on short speech turns as well as spontaneous speech style. Finally, we try to understand which information is captured by our neural model and its relation with different factors. Our experiences show that intermediate representations in the network automatically encode information on the speech style, the speaker’s accent as well as the broadcast program type. To take advantage of this analysis, we propose a multi-task system that is slightly more effective on the performance prediction task.