Houssein Ahmed Assowe - Building and evaluating for MT a bilingual corpus : Application to French-Somali

Organized by: 
Houssein Ahmed Assowe
Houssein Ahmed Assowe


Jury :

  • Hervé Blanchon, maitre de conferences, Université Grenoble Alpes, directeur de thèse
  • Mathieu Lafourcade, maitre de conferences, Université de Montpellier, rapporteur
  • Max Silberztein, professeur, Université de Franche-Comté, rapporteur
  • Christophe Roche, professeur, Université Savoie Mont Blanc, examinateur
  • Christian Boitet, professeur émérite, Universite Grenoble Alpes, invité

As part of ongoing work to computerize a large number of "poorly endowed" languages, especially those in the French-speaking world, we have created a French-Somali machine translation system dedicated to a journalistic sub-language, allowing to obtain quality translations from a bilingual body built by post-editing of GoogleTranslate results for the Somali and non-French speaking populations of the Horn of Africa. For this, we have created the very first quality French-Somali parallel corpus, comprising to date 98,912 words (about 400 standard pages) and 10,669 segments. The latter is an aligned corpus of very good quality, because we built in by post-editions editing pre-translations of produced by GT, which uses with a combination of the its French-English and English-Somali MT language pairs. It That corpus was also evaluated by 9 bilingual annotators who gave assigned a quality note score to each segment of the corpus and corrected our post-editing. From Using this growing body corpus as training corpusof work, we have built several successive versions of a MosesLIG-fr-so fragmented statistical Phrase-Based Automatic Machine Translation System (PBMT), which has proven to be better than GoogleTranslate on this language pair and this sub-language, in terms BLEU and of post-editing time. We also did used OpenNMT to build a first French-Somali neural automatic translationMT system and experiment it.in order to improve the results of TA without leading to prohibitive calculation times, both during training and during decoding. 
On the other hand, we have set up an iMAG (multilingual interactive access gateway) that allows non-French-speaking Somali surfers on the continent to access the online edition of the newspaper "La Nation de Djibouti" in Somali. The segments (sentences or titles), pre- automatically translated automatically by our any available fr-so MT system, can be post-edited and rated (out on a 1 to of 20scale) by the readers themselves, so as to improve the system by incremental learning, in the same way as the has been done before for the French-Chinese PBMT system. (PBMT) created by [Wang, 2015].