Abdelkader El Mahdaouy - Accès à l'information dans les grandes collections textuelles en langue arabe

09:00

Saturday

Dec

2017

Thesis defence

Organized by:

Abdelkader El Mahdaouy

Speaker:

Abdelkader El Mahdaouy

Teams:

Lieu de soutenance :

Salle de conférences de la Faculté des Sciences Dhar El Mahraz -Fès, Maroc

Jury :

M. Mohand Boughanem, professeur, Université Toulouse 3 - CNRS-IRIT, rapporteur
M. Pierre Zweigenbaum, directeur de recherche, Université Paris-Saclay LIMSI-CNRS, rapporteur
M. Mohammed Ouçamah Cherkaoui Malki, professeur, Faculté des Sciences Dhar El Mahraz -Fès, examinateur
M. Brahim Ouhbi, professeur, Ecole Nationale Supérieure d'Art et Métiers- Meknès, examinateur
M. Eric Gaussier, professeur, Université Grenoble Alpes -Grenoble, directeur de thèse
M. Saïd Ouatik El Alaoui, professeur, Faculté des Sciences Dhar El Mahraz -Fès, directeur de thèse

Given the amount of Arabic textual information available on the web, developing effective Information Retrieval Systems (IRS) has become essential to retrieve relevant information. Most of the current Arabic SRIs are based on the bag-of-words representation, where documents are indexed using surface words, roots or stems. Two main drawbacks of the latter representation are the ambiguity of Single Word Terms (SWTs) and term mismatch.
The aim of this work is to deal with SWTs ambiguity and term mismatch. Accordingly, we propose four contributions to improve Arabic content representation, indexing, and retrieval. The first contribution consists of representing Arabic documents using Multi-Word Terms (MWTs). The latter is motivated by the fact that MWTs are more precise representational units and less ambiguous than isolated SWTs. Hence, we propose a hybrid method to extract Arabic MWTs, which combines linguistic and statistic filtering of MWT candidates. The linguistic filter uses POS tagging to identify MWTs candidates that fit a set of syntactic patterns and handles the problem of MWTs variation. Then, the statistical filter rank MWT candidate using our proposed association measure that combines contextual information and both termhood and unithood measures. In the second contribution, we explore and evaluate several IR models for ranking documents using both SWTs and MWTs. Additionally, we investigate a wide range of proximity-based IR models for Arabic IR. Then, we introduce a formal condition that IR models should satisfy to deal adequately with term dependencies. The third contribution consists of a method based on Distributed Representation of Word vectors, namely Word Embedding (WE), for Arabic IR. It relies on incorporating WE semantic similarities into existing probabilistic IR models in order to deal with term mismatch. The aim is to allow distinct, but semantically similar terms to contribute to documents scores. The last contribution is a method to incorporate WE similarity into Pseud-Relevance Feedback PRF for Arabic Information Retrieval. The main idea is to select expansion terms using their distribution in the set of top pseudo-relevant documents along with their similarity to the original query terms.
The experimental validation of all the proposed contributions is performed using standard Arabic TREC 2002/2001 collection.