Marwa Hadj Salah - Arabic word sense disambiguation for and by machine translation

08:00

Tuesday

Dec

2018

Thesis defence

Speaker:

Marwa Hadj Salah

Teams:

GETALP

Keywords:

Word sense disambiguation
Machine Translation
Annotation transfer
Corpus enrichment

Venue :

Institut d'administration des entreprises de Grenoble (IAE)
525 Avenue Centrale, 38400 Saint-Martin-d'Hères
Room 120

Jury :

Herve Blanchon, maitre de conferences, Universite Grenoble Alpes, directeur de thèse
Mounir Zrigui, professeur, Université de Monastir - Tunisie, directeur de thèse
Didier Schwab, maitre de conferences, Universite Grenoble Alpes, examinateur
Patrick Paroubek, ingenieur de recherche, CNRS Ile-De France Gif-Sur-Yvette, rapporteur
Mohamed Jemni, professeur, Université de Tunis - Tunisie, rapporteur
Kamel Smaili, professeur, Université de LorrainE, examinateur

This thesis concerns a study of Word Sense Disambiguation (WSD), which is a central task in natural language processing and that can improve applications such as machine translation or information extraction. Researches in word sense disambiguation predominantly concern the English language, because the majority of other languages lacks a standard lexical reference for the annotation of corpora, and also lacks sense annotated corpora for the evaluation, and more importantly for the construction of word sense disambiguation systems. In English, the lexical database wordnet is a long-standing de-facto standard used in most sense annotated corpora and in most WSD evaluation campaigns.
Our contribution to this thesis focuses on several areas:
first of all, we present a method for the automatic creation of sense annotated corpora for any language, by taking advantage of the large amount of wordnet sense annotated English corpora, and by using a machine translation system. This method is applied on Arabic and is evaluated, to our knowledge, on the only Arabic manually sense annotated corpus with wordnet: the Arabic OntoNotes 5.0, which we have semi-automatically enriched.
Its evaluation is performed thanks to an implementation of two supervised word sense disambiguation systems that are trained on the corpora produced using our method. We hence propose a solid baseline for the evaluation of future Arabic word sense disambiguation systems, in addition to sense annotated Arabic corpora that we provide as a freely available resource.
Secondly, we propose an in vivo evaluation of our Arabic word sense disambiguation system by measuring its contribution to the performance of the machine translation task.